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INTRODUCTION: TOWARDS LESS CASUAL CAUSAL INFERENCES 


Causal Inference is an admittedly pretentious title for a book. A complex 
scientific task, causal inference relies on triangulating evidence from multiple 
sources and on the application of a variety of methodological approaches. No 
book can possibly provide a comprehensive description of all methodologies for 
causal inference across the sciences. The authors of any Causal Inference book 
will have to choose which aspects of causal inference methodology they want 
to emphasize. 

The title of this introduction reflects our own choices: a book that helps 
scientists—especially health and social scientists—generate and analyze data 
to make causal inferences that are explicit about both the causal question and 
the assumptions underlying the data analysis. Unfortunately, the scientific 
literature is plagued by studies in which the causal question is not explicitly 
stated and the investigators’ unverifiable assumptions are not declared. This 
casual attitude towards causal inference has led to a great deal of confusion. 
For example, it is not uncommon to find studies in which the effect estimates 
are hard to interpret because the data analysis methods cannot appropriately 
answer the causal question (were it explicitly stated) under the investigators’ 
assumptions (were they declared). 

In this book, we stress the need to take the causal question seriously enough 
to articulate it, and to delineate the separate roles of data and assumptions for 
causal inference. Once these foundations are in place, causal inferences become 
necessarily less casual, which helps prevent confusion. The book describes 
various data analysis approaches to estimate the causal effect of interest under 
a particular set of assumptions when data are collected on each individual in 
a population. A key message of the book is that causal inference cannot be 
reduced to a collection of recipes for data analysis. 

The book is divided in three parts of increasing difficulty: Part I is about 
causal inference without models (i.e., nonparametric identification of causal ef- 
fects), Part II is about causal inference with models (i.e., estimation of causal 
effects with parametric models), and Part III is about causal inference from 
complex longitudinal data (i.e., estimation of causal effects of time-varying 
treatments). Throughout the text, we have interspersed Fine Points and Tech- 
nical points that elaborate on certain topics mentioned in the main text. Fine 
Points are designed to be accessible to all readers while Technical Points are 
designed for readers with intermediate training in statistics. The book provides 
a cohesive presentation of concepts and methods for causal inference that are 
currently scattered across journals in several disciplines. We expect that it 
will be of interest to all professionals that make causal inferences, including 
epidemiologists, statisticians, psychologists, economists, sociologists, political 
scientists, computer scientists. .. 

This is not a philosophy book. We remain agnostic about metaphysical 
concepts like causality and cause. Instead, we focus on the identification and 
estimation of causal effects in populations, that is, numerical quantities that 
measure changes in the distribution of an outcome under different interven- 
tions. For example, we discuss how to estimate the risk of death in patients 
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with serious heart failure if they received a heart transplant versus if they did 
not. Through actionable causal inference, we want to help decision makers 
make better decisions. 

We are grateful to many people who have made this book possible. Stephen 
Cole, Sander Greenland, Jay Kaufman, Eleanor Murray, Sonja Swanson, Tyler 
Vander Weele, and Jan Vandenbroucke provided detailed comments. Goodarz 
Danaei, Kosuke Kawai, Martin Lajous, and Kathleen Wirth helped create the 
NHEFS dataset. The sample code in Part II was developed by Roger Logan in 
SAS, Eleanor Murray and Roger Logan in Stata, Joy Shi and Sean McGrath in 
R, and James Fiedler in Python. Roger Logan has also been our LaTeX wizard. 
Randall Chaput helped create the figures in Chapters 1 and 2. Josh McKible 
designed the book cover. Rob Calver, our patient publisher, encouraged us to 
write the book and supported our decision to make it freely available online. 

In addition, multiple colleagues have helped us improve the book by detect- 
ing typos and identifying unclear passages. We especially thank Kafui Adjaye- 
Gbewonyo, Alvaro Alonso, Katherine Almendinger, Ingelise Andersen, Juan 
José Beunza, Karen Biala, Joanne Brady, Alex Breskin, Shan Cai, Yu-Han 
Chiu, Alexis Dinno, James Fiedler, Birgitte Frederiksen, Tadayoshi Fushiki, 
Leticia Grize, Dominik Hangartner, Michael Hudgens, John Jackson, Marshall 
Joffe, Luke Keele, Laura Khan, Dae Hyun Kim, Lauren Kunz, Martin Lajous, 
Angeliki Lambrou, Wen Wei Loh, Haidong Lu, Mohammad Ali Mansournia, 
Giovanni Marchetti, Lauren McCarl, Shira Mitchell, Louis Mittel, Hannah Oh, 
Ibironke Olofin, Robert Paige, Jeremy Pertman, Melinda Power, Bruce Psaty, 
Brian Sauer, Tomohiro Shinozaki, Ian Shrier, Yan Song, Øystein Sørensen, 
Etsuji Suzuki, Denis Talbot, Mohammad Tavakkoli, Sarah Taubman, Evan 
Thacker, Kun-Hsing Yu, Vera Zietemann, Helmut Wasserbacher,Jessica Young, 
and Dorith Zimmermann. 
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Causal inference without models 


Chapter 1 
A DEFINITION OF CAUSAL EFFECT 


As a human being, you are already innately familiar with causal inference’s fundamental concepts. Through 
sheer existence, you know what a causal effect is, understand the difference between association and causation, and 
you have used this knowledge consistently throughout your life. Had you not, you’d be dead. Without basic causal 
concepts, you would not have survived long enough to read this chapter, let alone learn to read. As a toddler, you 
would have jumped right into the swimming pool after seeing those who did were later able to reach the jam jar. 
As a teenager, you would have skied down the most dangerous slopes after seeing those who did won the next ski 
race. As a parent, you would have refused to give antibiotics to your sick child after observing that those children 
who took their medicines were not at the park the next day. 

Since you already understand the definition of causal effect and the difference between association and cau- 
sation, do not expect to gain deep conceptual insights from this chapter. Rather, the purpose of this chapter is 
to introduce mathematical notation that formalizes the causal intuition that you already possess. Make sure that 
you can match your causal intuition with the mathematical notation introduced here. This notation is necessary 
to precisely define causal concepts, and will be used it throughout the book. 


1.1 Individual causal effects 


Zeus is a patient waiting for a heart transplant. On January 1, he receives a 
new heart. Five days later, he dies. Imagine that we can somehow know— 
perhaps by divine revelation—that had Zeus not received a heart transplant 
on January 1, he would have been alive five days later. Equipped with this 
information most would agree that the transplant caused Zeus’s death. The 
heart transplant intervention had a causal effect on Zeus’s five-day survival. 
Another patient, Hera, also received a heart transplant on January 1. Five 
days later she was alive. Imagine we can somehow know that, had Hera not 
received the heart on January 1, she would still have been alive five days later. 
Hence the transplant did not have a causal effect on Hera’s five-day survival. 
These two vignettes illustrate how humans reason about causal effects: 
We compare (usually only mentally) the outcome when an action A is taken 
versus the outcome when the action A is withheld. If the two outcomes differ, 
we say that the action A has a causal effect, causative or preventive, on the 
outcome. Otherwise, we say that the action A has no causal effect on the 
outcome. Epidemiologists, statisticians, economists, and other social scientists 
often refer to the action A as an intervention, an exposure, or a treatment. 
To make our causal intuition amenable to mathematical and statistical 
analysis we will introduce some notation. Consider a dichotomous treatment 
variable A (1: treated, 0: untreated) and a dichotomous outcome variable Y 
Capital letters represent random (1: death, 0: survival). In this book we refer to variables such as A and Y 
variables. Lower case letters denote that may have different values for different individuals as random variables. 
particular values of a random vari- Let Y°=! (read Y under treatment a = 1) be the outcome variable that would 
able. have been observed under the treatment value a = 1, and Y°~° (read Y under 
treatment a = 0) the outcome variable that would have been observed under 
the treatment value a = 0. Y°=! and Y*® are also random variables. Zeus 


Sometimes we abbreviate the ex- 
pression “individual ¿ has outcome 
Y* = 1" by writing Y,* = 1. Tech- 
nically, when 7 refers to a specific 
individual, such as Zeus, Y, is not 
a random variable because we are 
assuming that individual counter- 
factual outcomes are deterministic 
(see Technical Point 1.2). 


Causal effect for individual 7: 
ya=l x ya-0 


Consistency: 
if A; =a, then Y} = Y^ =Y; 


1.2 Average causal effects 


A definition of causal effect 


has Y*°=! = 1 and Y2~° = 0 because he died when treated but would have 
survived if untreated, while Hera has Y°=! = 0 and Y*-° = 0 because she 
survived when treated and would also have survived if untreated. 

We can now provide a formal definition of a causal effect for an individ- 
ual: The treatment A has a causal effect on an individual’s outcome Y if 
y¢=! 4 y*° for the individual. Thus, the treatment has a causal effect on 
Zeus’s outcome because Y°=! = 1 4 0 = Y2®, but not on Hera’s outcome 
because Y°=! = 0 = Y*=°, The variables Y°=! and Y*~° are referred to 
as potential outcomes or as counterfactual outcomes. Some authors prefer the 
term “potential outcomes” to emphasize that, depending on the treatment that 
is received, either of these two outcomes can be potentially observed. Other 
authors prefer the term “counterfactual outcomes” to emphasize that these 
outcomes represent situations that may not actually occur (that is, counter- 
to-the-fact situations). 

For each individual, one of the counterfactual outcomes—the one that cor- 
responds to the treatment value that the individual did receive—is actually 
factual. For example, because Zeus was actually treated (A = 1), his counter- 
factual outcome under treatment Y¢=! = 1 is equal to his observed (actual) 
outcome Y = 1. That is, an individual with observed treatment A equal to a, 
has observed outcome Y equal to his counterfactual outcome Y°. This equality 
can be succinctly expressed as Y = Y^ where Y^ denotes the counterfactual 
Y° evaluated at the value a corresponding to the individual’s observed treat- 
ment A. The equality Y = Y4 is referred to as consistency. 

Individual causal effects are defined as a contrast of the values of counterfac- 
tual outcomes, but only one of those outcomes is observed for each individual— 
the one corresponding to the treatment value actually experienced by the in- 
dividual. All other counterfactual outcomes remain unobserved. Because of 
missing data, individual effects cannot be identified, that is, they cannot be 
expressed as a function of the observed data (See Fine Point 2.1 for a possible 
exception.) 


We needed three pieces of information to define an individual causal effect: an 
outcome of interest, the actions a = 1 and a = 0 to be compared, and the 
individual whose counterfactual outcomes Y=? and Y2=! are to be compared. 
However, because identifying individual causal effects is generally not possible, 
we now turn our attention to an aggregated causal effect: the average causal 
effect in a population of individuals. To define it, we need three pieces of 
information: an outcome of interest, the actions a = 1 and a = 0 to be 
compared, and a well-defined population of individuals whose outcomes Y?~° 
and Y=! are to be compared. 

Take Zeus’s extended family as our population of interest. Table 1.1 shows 
the counterfactual outcomes under both treatment (a = 1) and no treatment 
(a = 0) for all 20 members of our population. Focus on the last column: the 
outcome Y*=! that would have been observed for each individual if they had 
received the treatment (a heart transplant). Half of the members of the popu- 
lation (10 out of 20) would have died if they had received a heart transplant. 
That is, the proportion of individuals that would have developed the outcome 
had all population individuals received a = 1 is Pr[Y°=! = 1] = 10/20 = 0.5. 
Similarly, from the other column of Table 1.1, we can conclude that half of 
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Fine Point 1.1 


Interference. Our definition of a counterfactual outcome implicitly assumes that an individual’s counterfactual outcome 
under treatment value a does not depend on other individuals’ treatment values. For example, we implicitly assumed 
that Zeus would die if he received a heart transplant, regardless of whether Hera also received a heart transplant. That 
is, Hera’s treatment value did not interfere with Zeus’s outcome. On the other hand, suppose that Hera’s getting 
a new heart upsets Zeus to the extent that he would not survive his own heart transplant, even though he would 
have survived had Hera not been transplanted. In this scenario, Hera’s treatment interferes with Zeus’s outcome. 
Interference between individuals is common in studies that deal with contagious agents or educational programs, in 
which an individual’s outcome is influenced by their social interaction with other population members. 

In the presence of interference, the counterfactual Y,“ for an individual ¿ is not well defined because an individual's 
outcome depends on other individuals’ treatment values. When there is interference, “the causal effect of heart transplant 
on Zeus’s outcome” is not well defined. Rather, one needs to refer to “the causal effect of heart transplant on Zeus’s 
outcome when Hera does not get a new heart” or “the causal effect of heart transplant on Zeus’s outcome when Hera 
does get a new heart.” If other relatives and friends’ treatment also interfere with Zeus’s outcome, then one may need 
to refer to the causal effect of heart transplant on Zeus's outcome when “no relative or friend gets a new heart,” “when 
only Hera gets a new heart,” etc. because the causal effect of treatment on Zeus’s outcome may differ for each particular 
allocation of hearts. The assumption of no interference was labeled “no interaction between units” by Cox (1958), and 
is included in the “stable-unit-treatment-value assumption (SUTVA)” described by Rubin (1980). See Halloran and 
Struchiner (1995), Sobel (2006), Rosenbaum (2007), and Hudgens and Halloran (2009) for a more detailed discussion 
of the role of interference in the definition of causal effects. Unless otherwise specified, we will assume no interference 
throughout this book. 








Table 1.1 the members of the population (10 out of 20) would have died if they had not 
yo=0 yal received a heart transplant. That is, the proportion of individuals that would 
Rheia 0 I have developed the outcome had all population individuals received a = 0 is 
Kronos 1 0 Pr[Y °=? = 1] = 10/20 = 0.5. We have computed the counterfactual risk under 
Meimerer 0 0 treatment to be 0.5 by counting the number of deaths (10) and dividing them 
Hades 0 0 by the total number of individuals (20), which is the same as computing the 
Hestia 0 0 average of the counterfactual outcomes across all individuals in the population. 
Poceidóñ 1 0 To see the equivalence between risk and average for a dichotomous outcome, 
Hera 0 0 use the data in Table 1.1 to compute the average of Y=. 
Zeus 0 1 We are now ready to provide a formal definition of the average causal effect 
Artemis 1 1 in the population: An average causal effect of treatment A on outcome Y 
Apollo 1 0 is present if Pr[Y¢=! = 1] Æ Pr[Y°=° = 1] in the population of interest. 
Leto 0 1 Under this definition, treatment A does not have an average causal effect on 
Ares 1 1 outcome Y in our population because both the risk of death under treatment 
Athena 1 1 Pr[Y*=! = 1] and the risk of death under no treatment Pr[Y7=° = 1] are 
Hephaestus 0 1 0.5. It does not matter whether all or none of the individuals receive a heart 
Aphrodite 0 1 transplant: Half of them would die in either case. When, like here, the average 
Cyclope 0 1 causal effect in the population is null, we say that the null hypothesis of no 
Persephone 1 1 average causal effect is true. Because the risk equals the average and because 
Hermes 1 0 the letter E is usually employed to represent the population average or mean 
Hebe 1 0 (also referred to as ‘E’xpectation), we can rewrite the definition of a non-null 
Dionysus 1 0 average causal effect in the population as E[Y*=!] Æ E[Y°*=°] so that the 


definition applies to both dichotomous and nondichotomous outcomes. 


The presence of an “average causal effect of heart transplant A” is defined 
by a contrast that involves the two actions “receiving a heart transplant (a = 
1)” and “not receiving a heart transplant (a = 0).” When more than two 
actions are possible (i.e., the treatment is not dichotomous), the particular 
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Fine Point 1.2 


Multiple versions of treatment. Our definition of a counterfactual outcome under treatment value a also implicitly 
assumes that there is only one version of treatment value A = a. For example, we said that Zeus would die if he 
received a heart transplant. This statement implicitly assumes that all heart transplants are performed by the same 
surgeon using the same procedure and equipment. That is, there is only one version of the treatment “heart transplant.” 
If there were multiple versions of treatment (e.g., surgeons with different skills), then it is possible that Zeus would 
survive if his transplant were performed by Asclepios, and would die if his transplant were performed by Hygieia. In 
the presence of multiple versions of treatment, the counterfactual Y,“ for an individual z is not well defined because an 
individual's outcome depends on the version of treatment a. When there are multiple versions of treatment, “the causal 
effect of heart transplant on Zeus’s outcome” is not well defined. Rather, one needs to refer to “the causal effect of 
heart transplant on Zeus’s outcome when Asclepios performs the surgery” or “the causal effect of heart transplant on 
Zeus's outcome when Hygieia performs the surgery.” If other components of treatment (e.g., procedure, place) are also 
relevant to the outcome, then one may need to refer to “the causal effect of heart transplant on Zeus's outcome when 
Asclepios performs the surgery using his rod at the temple of Kos” because the causal effect of treatment on Zeus's 
outcome may differ for each particular version of treatment. 

Like the assumption of no interference (see Fine Point 1.1), the assumption of no multiple versions of treatment is 
included in the “stable-unit-treatment-value assumption (SUTVA)” described by Rubin (1980). Robins and Greenland 
(2000) made the point that if the versions of a particular treatment (e.g., heart transplant) had the same causal effect 
on the outcome (survival), then the counterfactual Y°=! would be well-defined. VanderWeele (2009) formalized this 
point as the assumption of “treatment variation irrelevance,” i.e., the assumption that multiple versions of treatment 
A = a may exist but they all result in the same outcome Y,*. We return to this issue in Chapter 3 but, unless otherwise 
specified, we will assume treatment variation irrelevance throughout this book. 


Average causal effect in population: contrast of interest needs to be specified. For example, “the causal effect of 

Bye] # E[Y °=] aspirin” is meaningless unless we specify that the contrast of interest is, say, 
“taking, while alive, 150 mg of aspirin by mouth (or nasogastric tube if need be) 
daily for 5 years” versus “not taking aspirin.” This causal effect is well defined 
even if counterfactual outcomes under other interventions are not well defined 
or do not exist (e.g., “taking, while alive, 500 mg of aspirin by absorption 
through the skin daily for 5 years”). 


Absence of an average causal effect does not imply absence of individual 
effects. Table 1.1 shows that treatment has an individual causal effect on 
12 members (including Zeus) of the population because, for each of these 12 
individuals, the value of their counterfactual outcomes Y°=! and Y°~° differ. 
Of the 12 , 6 were harmed by treatment, including Zeus (Y¢~! — Y°=° = 1), 
and 6 were helped (Y°-! — Y*-° = —1). This equality is not an accident: 
The average causal effect E[Y°=!] — E[Y°=°] is always equal to the average 
E[Y °=} — Y°=°] of the individual causal effects Y°=' — Y°=°, as a difference 
of averages is equal to the average of the differences. When there is no causal 
effect for any individual in the population, i.e., Y=! = Y¢~° for all individuals, 
we say that the sharp causal null hypothesis is true. The sharp causal null 
hypothesis implies the null hypothesis of no average effect. 


As discussed in the next chapters, average causal effects can sometimes be 
identified from data, even if individual causal effects cannot. Hereafter we refer 
to ‘average causal effects’ simply as ‘causal effects’ and the null hypothesis of 
no average effect as the causal null hypothesis. We next describe different 
measures of the magnitude of a causal effect. 
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Technical Point 1.1 


Causal effects in the population. Let E[Y °] be the mean counterfactual outcome had all individuals in the population 
received treatment level a. For discrete outcomes, the mean or expected value E[Y“] is defined as the weighted sum 
>, Y pye (y) over all possible values y of the random variable Y“, where pya (-) is the probability mass function of Y°, 
i.e., pya (y) = Pr[Y°% = y]. For dichotomous outcomes, E[Y*] = Pr[Y* = 1]. For continuous outcomes, the expected 
value E[Y“] is defined as the integral f y fy (y) dy over all possible values y of the random variable Y°, where fya (-) 
is the probability density function of Y“. A common representation of the expected value that applies to both discrete 
and continuous outcomes is E[Y?] = f ydFy (y), where Fy« (-) is the cumulative distribution function (CDF) of the 
random variable Y*. We say that there is a non-null average causal effect in the population if E[Y°] # E(Y“] for any 
two values a and a’. 

The average causal effect, defined by a contrast of means of counterfactual outcomes, is the most commonly 
used population causal effect. However, a population causal effect may also be defined as a contrast of functionals, 
including medians, variances, hazards, or CDFs of counterfactual outcomes. In general, a population causal effect can be 
defined as a contrast of any functional of the marginal distributions of counterfactual outcomes under different actions 
or treatment values. For example the population causal effect on the variance is defined as var(Y°=') — var(Y°=°), 
which is zero for the population in Table 1.1 since the distribution of Y°=! and Y°~° are identical—both having 6 
deaths out of 20. In fact, the equality of these distributions imply that for any functional (e.g., mean, variance, median, 
hazard,etc.), the population causal effect on the functional is zero. However, in contrast to the mean, the difference 
in population variances var(Y°=!) — var(Y“=°) does not in general equal the variance of the individual causal effects 
var(Y*=! — Y*=°). For example, in Table 1.1, since Y°=! — Y=? is not constant (—1 for 6 individuals, 1 for 6 
individuals and 0 for 8 individuals), var(Y °=} — Y°=°) > 0 = var(Y °=!) — var(Y7=°). We will be able to identify 
(i.e., compute) var(Y °=!) — var(Y*=°) from the data collected in a randomized trial, but not var(Y°=! — Y°=°) 
because we can never simultaneously observe both Y¢=! and Y°=° for any individual, and thus the covariance of Y°= 
and Y*=° is not identified. The above discussion is true not only for the variance but for for any nonlinear functional 
(e.g., median, hazard). 


1.3 Measures of causal effect 


We have seen that the treatment ‘heart transplant’ A does not have a causal 
effect on the outcome ‘death’ Y in our population of 20 family members of 
Zeus. The causal null hypothesis holds because the two counterfactual risks 
Pr[Y*= = 1] and Pr[Y*=° = 1] are equal to 0.5. There are equivalent ways 
of representing the causal null. For example, we could say that the risk 
Pr[Y7=! = 1] minus the risk Pr [Y°=° = 1] is zero (0.5 — 0.5 = 0) or that 
the risk Pr[Y*=! = 1] divided by the risk Pr [Y°=° = 1] is one (0.5/0.5 = 1). 

The causal risk difference in the That is, we can represent the causal null by 

population is the average of the in- 

dividual causal effects Y°=!_—ya=0 (i) Pr[Y*"* = 1] — Pr[Y°= = 1] =0 

on the difference scale, i.e., it is 


AF Pr[Y °=! = 1] 
a measure of the average individ- (ii) = ae ie 
Pr[Ye=° = 1] 
ual causal effect. By contrast, the 
causal risk ratio in the population eee Pe YS Dri et 
. PEER (iii) = = =1 
is not the average of the individual Pr[Y °= = 1]/ Pr[Y °= = 0] 


causal effects Y°=!/Y°=° on the 


ratio scale, i.e., it is a measure of 
causal effect in the population but 
is not the average of any individual 
causal effects. 


where the left-hand side of the equalities (i), (ii), and (iii) is the causal risk 
difference, risk ratio, and odds ratio, respectively. 

Suppose now that another treatment A, cigarette smoking, has a causal 
effect on another outcome Y, lung cancer, in our population. The causal null 
hypothesis does not hold: Pr[Y¢=! = 1] and Pr[Y?=° = 1] are not equal. In 
this setting, the causal risk difference, risk ratio, and odds ratio are not 0, 1, 
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Fine Point 1.3 


Number needed to treat. Consider a population of 100 million patients in which 20 million would die within five years 
if treated (a = 1), and 30 million would die within five years if untreated (a = 0). This information can be summarized 
in several equivalent ways: 


e the causal risk difference is Pr[Y °=! = 1] — Pr[Y*=° = 1] = 0.2 — 0.3 = —0.1 


e if one treats the 100 million patients, there will be 10 million fewer deaths than if one does not treat those 100 
million patients. 


e one needs to treat 100 million patients to save 10 million lives 


e on average, one needs to treat 10 patients to save 1 life 


We refer to the average number of individuals that need to receive treatment a = 1 to reduce the number of cases 
Y = 1 by one as the number needed to treat (NNT). In our example the NNT is equal to 10. For treatments that 
reduce the average number of cases (i.e., the causal risk difference is negative), the NNT is equal to the reciprocal of 
the absolute value of the causal risk difference: 


—1 


NNT = 
Pr[ye=! = 1) — Pr Y= = 1] 





For treatments that increase the average number of cases (i.e., the causal risk difference is positive), one can 
symmetrically define the number needed to harm. The NNT was introduced by Laupacis, Sackett, and Roberts (1988). 
Like the causal risk difference, the NNT applies to the population and time interval on which it is based. For a discussion 
of the relative advantages and disadvantages of the NNT as an effect measure, see Grieve (2003). 


and 1, respectively. Rather, these causal parameters quantify the strength of 
the same causal effect on different scales. Because the causal risk difference, 
risk ratio, and odds ratio (and other summaries) measure the causal effect, we 
refer to them as effect measures. 

Each effect measure may be used for different purposes. For example, 
imagine a large population in which 3 in a million individuals would develop the 
outcome if treated, and 1 in a million individuals would develop the outcome if 
untreated. The causal risk ratio is 3, and the causal risk difference is 0.000002. 
The causal risk ratio (multiplicative scale) is used to compute how many times 
treatment, relative to no treatment, increases the disease risk. The causal risk 
difference (additive scale) is used to compute the absolute number of cases of 
the disease attributable to the treatment. The use of either the multiplicative 
or additive scale will depend on the goal of the inference. 


1.4 Random variability 


At this point you could complain that our procedure to compute effect measures 
is somewhat implausible. Not only did we ignore the well known fact that the 
immortal Zeus cannot die, but—more to the point—our population in Table 
1.1 had only 20 individuals. The populations of interest are typically much 
larger. 

In our tiny population, we collected information from all the individuals. In 


1.4 Random variability 


1% source of random error: 
Sampling variability 


An estimator 6 of 0 is consistent 
if, with probability approaching 1, 
the difference 6—0 approaches zero 
as the sample size increases towards 
infinity. 


Caution: the term ‘consistency’ 
when applied to estimators has a 
different meaning from that which 
it has when applied to counterfac- 
tual outcomes. 


24 source of random error: 
Nondeterministic counterfactuals 


practice, investigators only collect information on a sample of the population 
of interest. Even if the counterfactual outcomes of all study individuals were 
known, working with samples prevents one from obtaining the exact proportion 
of individuals in the population who had the outcome under treatment value 
a, e.g., the probability of death under no treatment Pr[Y*~° = 1] cannot be 
directly computed. One can only estimate this probability. 

Consider the individuals in Table 1.1. We have previously viewed them 
as forming a twenty-person population. Suppose we view them as a random 
sample from a much larger, near-infinite super-population (e.g., all immor- 
tals). We denote the proportion of individuals in the sample who would have 
died if unexposed as Pr[y2=0 = 1] = 10/20 = 0.50. The sample proportion 
Pr[y2=0 = 1] does not have to be exactly equal to the proportion of individ- 
uals who would have died if the entire super-population had been unexposed, 
Pr[Y*=° = 1]. For example, suppose Pr[Y°=° = 1] = 0.57 in the population 
but, because of random error due to sampling variability, Pr[Y °=? = 1] =0.5in 
our particular sample. We use the sample proportion Pry e = 1] to estimate 
the super-population probability Pr[Y* = 1] under treatment value a. The 
“hat” over Pr indicates that the sample proportion Prive = 1] is an estimator 
of the corresponding population quantity Pr[Y* = 1]. We say that Pr[y2 = 1] 
is a consistent estimator of Pr[Y°® = 1] because the larger the number of in- 
dividuals in the sample, the smaller the difference between Priy@ = 1] and 
Pr[Y* = 1] is expected to be. This occurs because the error due to sampling 
variability is random and thus obeys the law of large numbers. 

Because the super-population probabilities Pr[Y* = 1] cannot be computed, 
only consistently estimated by the sample proportions Priy@ = 1], one cannot 
conclude with certainty that there is, or there is not, a causal effect. Rather, a 
statistical procedure must be used to evaluate the empirical evidence regarding 
the causal null hypothesis Pr[Y°=! = 1] = Pr[Y*=° = 1] (see Chapter 10 for 
details). 

So far we have only considered sampling variability as a source of random 
error. But there may be another source of random variability: perhaps the 
values of an individual’s counterfactual outcomes are not fixed in advance. 
We have defined the counterfactual outcome Y® as the individual’s outcome 
had he received treatment value a. For example, in our first vignette, Zeus 
would have died if treated and would have survived if untreated. As defined, 
the values of the counterfactual outcomes are fixed or deterministic for each 
individual, e.g., Y°=! = 1 and Y°=° = 0 for Zeus. In other words, Zeus 
has a 100% chance of dying if treated and a 0% chance of dying if untreated. 
However, we could imagine another scenario in which Zeus has a 90% chance 
of dying if treated, and a 10% chance of dying if untreated. In this scenario, 
the counterfactual outcomes are stochastic or nondeterministic because Zeus’s 
probabilities of dying under treatment (0.9) and under no treatment (0.1) 
are neither zero nor one. The values of Y°=! and Y*=° shown in Table 1.1 
would be possible realizations of “random flips of mortality coins” with these 
probabilities. Further, one would expect that these probabilities vary across 
individuals because not all individuals are equally susceptible to develop the 
outcome. Quantum mechanics, in contrast to classical mechanics, holds that 
outcomes are inherently nondeterministic. That is, if the quantum mechanical 
probability of Zeus dying is 90%, the theory holds that no matter how much 
data we collect about Zeus, the uncertainty about whether Zeus will actually 
develop the outcome if treated is irreducible. 

Thus, in causal inference, random error derives from sampling variability, 
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Technical Point 1.2 


Nondeterministic counterfactuals. For nondeterministic counterfactual outcomes, the mean outcome under treatment 
value a, E[Y“], equals the weighted sum 5° ypy«(y) over all possible values y of the random variable Y®, where the 


y 
probability mass function pya (-) = E[Qye (-)], and Qya (y) is a random probability of having outcome Y = y under 
treatment level a. In the example described in the text, Qya=1 (1) = 0.9 for Zeus. (For continuous outcomes, the 
weighted sum is replaced by an integral.) 

More generally, a nondeterministic definition of counterfactual outcome does not attach some particular value 
of the random variable Y“% to each individual, but rather an individual-specific statistical distribution Oya (-) of Y°. 
The nondeterministic definition of causal effect is a generalization of the deterministic definition in which Oya (-) is 
now a random CDF that may take values between 0 and 1. The average counterfactual outcome in the population 
E[Y“] equals E {E [Y" | Oya (-)]}. Therefore, E[Y*] = E [f y dOya (y)] = f y dE[Oy« (y)] = f y dFya (y), where 
Fya () = E [Əy ()]. 

If the counterfactual outcomes are binary and nondeterministic, the causal risk ratio in the population rocl 
ya=0 
is equal to the weighted average E |W {Qy-o=1 (1) /Qya=0 (1)}] of the individual causal effects Qya=1 (1) /Qya=o (1) 


Gyan Į provided Qya=0 (1) is never equal to 0 (i.e., deterministic) for anyone 


on the ratio scale, with weights W = Heo] 


in the population. 


nondeterministic counterfactuals, or both. However, for pedagogic reasons, we 
will continue to largely ignore random error until Chapter 10. Specifically, we 
will assume that counterfactual outcomes are deterministic and that we have 
recorded data on every individual in a very large (perhaps hypothetical) super- 
population. This is equivalent to viewing our population of 20 individuals as a 
population of 20 billion individuals in which 1 billion individuals are identical 
to Zeus, 1 billion individuals are identical to Hera, and so on. Hence, until 
Chapter 10, we will carry out our computations with Olympian certainty. 

Then, in Chapter 10, we will describe how our statistical estimates and 
confidence intervals for causal effects in the super-population are identical ir- 
respective of whether the world is stochastic (quantum) or deterministic (classi- 
cal) at the level of individuals. In contrast, confidence intervals for the average 
causal effect in the actual study sample will differ depending on whether coun- 
terfactuals are deterministic versus stochastic. Fortunately, super-population 
effects are in most cases the causal effects of substantive interest. 


1.5 Causation versus association 


Obviously, the data available from actual studies look different from those 
shown in Table 1.1. For example, we would not usually expect to learn Zeus’s 
outcome if treated Y°=! and also Zeus’s outcome if untreated Y°=°. In the 
real world, we only get to observe one of those outcomes because Zeus is either 
treated or untreated. We referred to the observed outcome as Y. Thus, for 
each individual, we know the observed treatment level A and the outcome Y 
as in Table 1.2. 

The data in Table 1.2 can be used to compute the proportion of individuals 
that developed the outcome Y among those individuals in the population that 
happened to receive treatment value a. For example, in Table 1.2, 7 individuals 


1.5 Causation versus association 


Dawid (1979) introduced the sym- 
bol IL to denote independence 


Table 1.2 





Rheia 
Kronos 
Demeter 
Hades 
Hestia 
Poseidon 
Hera 

Zeus 
Artemis 
Apollo 
Leto 

Ares 
Athena 
Hephaestus 
Aphrodite 
Cyclope 
Persephone 
Hermes 
Hebe 
Dionysus 
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For a continuous outcome Y we 
define mean independence between 
treatment and outcome as: 

E[Y|A = 1] =E[Y|A = 0]. 
Independence and mean indepen- 
dence are the same concept for di- 
chotomous outcomes. 
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died (Y = 1) among the 13 individuals that were treated (A = 1). Thus the 
risk of death in the treated, Pr[Y = 1|A = 1], was 7/13. More generally, the 
conditional probability Pr[Y = 1|.A = a] is defined as the proportion of individ- 
uals that developed the outcome Y among those individuals in the population 
of interest that happened to receive treatment value a. 


When the proportion of individuals who develop the outcome in the treated 
Pr[Y = 1|A = 1] equals the proportion of individuals who develop the outcome 
in the untreated Pr[Y = 1|A = 0], we say that treatment A and outcome Y 
are independent, that A is not associated with Y, or that A does not predict 
Y. Independence is represented by Y lL A—or, equivalently, AL Y— which is 
read as Y and A are independent. Some equivalent definitions of independence 
are 


(i) Pr[Y =1|A = 1] — Pr[Y =1|A =0] =0 


a Pri¥=1/A=1] _ 
(i) Bry =de 7) 


we, Pr[Y = 1|/A=1]/Pr[Y =0|A=1] | 1 
Gd) Sy = 1A = oP =O] =) ~ 


where the left-hand side of the inequalities (i), (ii), and (iii) is the associational 
risk difference, risk ratio, and odds ratio, respectively. 


We say that treatment A and outcome Y are dependent or associated when 
Pr[Y = 1|A = 1] # Pr[Y = 1|A = 0]. In our population, treatment and 
outcome are indeed associated because Pr[Y = 1|A = 1] = 7/13 and Pr[Y = 
1|A = 0] = 3/7. The associational risk difference, risk ratio, and odds ratio 
(and other measures) quantify the strength of the association when it exists. 
They measure the association on different scales, and we refer to them as 
association measures. These measures are also affected by random variability. 
However, until Chapter 10, we will disregard statistical issues by assuming that 
the population in Table 1.2 is extremely large. 








For dichotomous outcomes, the risk equals the average in the population, 
and we can therefore rewrite the definition of association in the population as 
E[Y|A = 1] # E[Y|A=0]. For continuous outcomes Y, we will also define 
association as E[Y|A=1] # E[Y|A= 0]. For binary A, Y and A are not 
associated if and only if they are not statistically correlated. 


In our population of 20 individuals, we found (i) no causal effect after com- 
paring the risk of death if all 20 individuals had been treated with the risk 
of death if all 20 individuals had been untreated, and (ii) an association after 
comparing the risk of death in the 13 individuals who happened to be treated 
with the risk of death in the 7 individuals who happened to be untreated. 
Figure 1.1 depicts the causation-association difference. The population (repre- 
sented by a diamond) is divided into a white area (the treated) and a smaller 
grey area (the untreated). 
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Figure 1.1 


The difference between association 
and causation is critical. Suppose 
the causal risk ratio of 5-year mor- 
tality is 0.5 for aspirin vs. no as- 
pirin, and the corresponding asso- 
ciational risk ratio is 1.5 because 
individuals at high risk of cardiovas- 
cular death are preferentially pre- 
scribed aspirin. After a physician 
learns these results, she decides to 
withhold aspirin from her patients 
because those treated with aspirin 
have a greater risk of dying com- 
pared with the untreated. The doc- 
tor will be sued for malpractice. 


A definition of causal effect 


Population of interest 


Treated Untreated 
Causation Association 
EL Ye a E[Y*" =O] E[Y|A = = 1] E[Y|A = = 0] 


The definition of causation implies a contrast between the whole white 
diamond (all individuals treated) and the whole grey diamond (all individu- 
als untreated), whereas association implies a contrast between the white (the 
treated) and the grey (the untreated) areas of the original diamond. That is, 
inferences about causation are concerned with what if questions in counterfac- 
tual worlds, such as “what would be the risk if everybody had been treated?” 
and “what would be the risk if everybody had been untreated?” , whereas infer- 
ences about association are concerned with questions in the actual world, such 
as “what is the risk in the treated?” and “what is the risk in the untreated?” 

We can use the notation we have developed thus far to formalize this dis- 
tinction between causation and association. The risk Pr[Y = 1|A = q] isa 
conditional probability: the risk of Y in the subset of the population that 
meet the condition ‘having actually received treatment value a’ (i.e., A = a). 
In contrast the risk Pr[Y* = 1] is an unconditional—also known as marginal— 
probability, the risk of Y* in the entire population. Therefore, association is 
defined by a different risk in two disjoint subsets of the population determined 
by the individuals’ actual treatment value (A = 1 or A = 0), whereas causa- 
tion is defined by a different risk in the same population under two different 
treatment values (a = 1 or a = 0). Throughout this book we often use the 
redundant expression ‘causal effect’ to avoid confusions with a common use of 
‘effect’? meaning simply association. 

These radically different definitions explain the well-known adage “asso- 
ciation is not causation.” In our population, there was association because 
the mortality risk in the treated (7/13) was greater than that in the untreated 
(3/7). However, there was no causation because the risk if everybody had been 
treated (10/20) was the same as the risk if everybody had been untreated. This 
discrepancy between causation and association would not be surprising if those 
who received heart transplants were, on average, sicker than those who did not 
receive a transplant. In Chapter 7 we refer to this discrepancy as confounding. 

Causal inference requires data like the hypothetical data in Table 1.1, but 
all we can ever expect to have is real world data like those in Table 1.2. The 
question is then under which conditions real world data can be used for causal 
inference. The next chapter provides one answer: conduct a randomized ex- 
periment. 


Chapter 2 
RANDOMIZED EXPERIMENTS 


Does your looking up at the sky make other pedestrians look up too? This question has the main components 
of any causal question: we want to know whether an action (your looking up) affects an outcome (other people’s 
looking up) in a specific population (say, residents of Madrid in 2019). Suppose we challenge you to design a 
scientific study to answer this question. “Not much of a challenge,” you say after some thought, “I can stand on 
the sidewalk and flip a coin whenever someone approaches. If heads, Pll look up; if tails, Pll look straight ahead. 
Ill repeat the experiment a few thousand times. If the proportion of pedestrians who looked up within 10 seconds 
after I did is greater than the proportion of pedestrians who looked up when I didn’t, I will conclude that my 
looking up has a causal effect on other people’s looking up. By the way, I may hire an assistant to record what 
people do while I’m looking up.” After conducting this study, you found that 55% of pedestrians looked up when 
you looked up but only 1% looked up when you looked straight ahead. 

Your solution to our challenge was to conduct a randomized experiment. It was an experiment because the 
investigator (you) carried out the action of interest (looking up), and it was randomized because the decision to 
act on any study subject (pedestrian) was made by a random device (coin flipping). Not all experiments are 
randomized. For example, you could have looked up when a man approached and looked straight ahead when a 
woman did. Then the assignment of the action would have followed a deterministic rule (up for man, straight for 
woman) rather than a random mechanism. However, your findings would not have been nearly as convincing if 
you had conducted a non randomized experiment. If your action had been determined by the pedestrian’s sex, 
critics could argue that the “looking up” behavior of men and women differs (women may look up less often than 
do men after you look up) and thus your study compared essentially “noncomparable” groups of people. This 
chapter describes why randomization results in convincing causal inferences. 


2.1 Randomization 


In a real world study we will not know both of Zeus’s potential outcomes Y°~ 
under treatment and Y=? under no treatment. Rather, we can only know 
his observed outcome Y under the treatment value A that he happened to 
receive. Table 2.1 summarizes the available information for our population 
of 20 individuals. Only one of the two counterfactual outcomes is known for 
each individual: the one corresponding to the treatment level that he actually 
Neyman (1923) applied counterfac- received. The data are missing for the other counterfactual outcomes. As we 
tual theory to the estimation of discussed in the previous chapter, this missing data creates a problem because 
causal effects via randomized ex- it appears that we need the value of both counterfactual outcomes to compute 
periments effect measures. The data in Table 2.1 are only good to compute association 
measures. 
Randomized experiments, like any other real world study, generate data with 
missing values of the counterfactual outcomes as shown in Table 2.1. However, 
randomization ensures that those missing values occurred by chance. As a 
result, effect measures can be computed —or, more rigorously, consistently 
estimated—in randomized experiments despite the missing data. Let us be 
more precise. 
Suppose that the population represented by a diamond in Figure 1.1 was 
near-infinite, and that we flipped a coin for each individual in such population. 
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Y“1LA for all a 


Randomized experiments 


We assigned the individual to the white group if the coin turned tails, and 
to the grey group if it turned heads. Note this was not a fair coin because 
the probability of heads was less than 50%—fewer people ended up in the 
grey group than in the white group. Next we asked our research assistants to 
administer the treatment of interest (A = 1), to individuals in the white group 
and a placebo (A = 0) to those in the grey group. Five days later, at the end of 
the study, we computed the mortality risks in each group, Pr[Y = 1|A = 1] = 
0.3 and Pr[Y = 1|A = 0] = 0.6. The associational risk ratio was 0.3/0.6 = 0.5 
and the associational risk difference was 0.3 — 0.6 = —0.3. We will assume 
that this was an ideal randomized experiment in all other respects: no loss to 
follow-up, full adherence to the assigned treatment over the duration of the 
study, a single version of treatment, and double blind assignment (see Chapter 
9). Ideal randomized experiments are unrealistic but useful to introduce some 
key concepts for causal inference. Later in this book we consider more realistic 
randomized experiments. 

Now imagine what would have happened if the research assistants had 
misinterpreted our instructions and had treated the grey group rather than 
the white group. Say we learned of the misunderstanding after the study 
finished. How does this reversal of treatment status affect our conclusions? 
Not at all. We would still find that the risk in the treated (now the grey 
group) Pr[Y = 1|A = 1] is 0.3 and the risk in the untreated (now the white 
group) Pr[Y = 1|A = 0] is 0.6. The association measure would not change. 
Because individuals were randomly assigned to white and grey groups, the 
proportion of deaths among the exposed, Pr[Y = 1|A = 1] is expected to be 
the same whether individuals in the white group received the treatment and 
individuals in the grey group received placebo, or vice versa. When group 
membership is randomized, which particular group received the treatment is 
irrelevant for the value of Pr[Y = 1|A = 1]. The same reasoning applies to 
Pr[Y = 1|A = 0], of course. Formally, we say that groups are exchangeable. 

Exchangeability means that the risk of death in the white group would have 
been the same as the risk of death in the grey group had individuals in the white 
group received the treatment given to those in the grey group. That is, the risk 
under the potential treatment value a among the treated, Pr[Y* = 1|A = 1], 
equals the risk under the potential treatment value a among the untreated, 
Pr[Y* = 1|A = 0], for both a = 0 and a = 1. An obvious consequence of these 
(conditional) risks being equal in all subsets defined by treatment status in the 
population is that they must be equal to the (marginal) risk under treatment 
value a in the whole population: Pr[y*® = 1|A = 1] = Pr[Y* = 1|A = 0] = 
Pr[Y* = 1]. Because the counterfactual risk under treatment value a is the 
same in both groups A = 1 and A = 0, we say that the actual treatment A 
does not predict the counterfactual outcome Y“. Equivalently, exchangeability 
means that the counterfactual outcome and the actual treatment are indepen- 
dent, or Y° ILA, for all values a. Randomization is so highly valued because it 
is expected to produce exchangeability. When the treated and the untreated 
are exchangeable, we sometimes say that treatment is exogenous, and thus 
exogeneity is commonly used as a synonym for exchangeability. 

The previous paragraph argues that, in the presence of exchangeability, the 
counterfactual risk under treatment in the white part of the population would 
equal the counterfactual risk under treatment in the entire population. But the 
risk under treatment in the white group is not counterfactual at all because the 
white group was actually treated! Therefore our ideal randomized experiment 
allows us to compute the counterfactual risk under treatment in the population 
Pr[Y*=1 = 1] because it is equal to the risk in the treated Pr[Y = 1|A = 1] = 
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Technical Point 2.1 


Full exchangeability and mean exchangeability. Randomization makes the Y“ jointly independent of A which implies, 
but is not implied by, exchangeability Y° LA for each a. Formally, let A = {a,a’,a”,...} denote the set of all treatment 


values present in the population, and Y4 = ae yo yo". seit the set of all counterfactual outcomes. Randomization 


makes YILA. We refer to this joint independence as full exchangeability. For a dichotomous treatment, A = {0,1} 
and full exchangeability is (Y°"!, Y°-°) ILA. 

For a dichotomous outcome and treatment, exchangeability Y°1LA can also be written as Pr [Y° = 1|A = 1] = 
Pr [Y° = 1|A = 0] or, equivalently, as E[Y°|A = 1] = E[Y°|A = 0] for all a. We refer to the last equality as mean 
exchangeability. For a continuous outcome, exchangeability Y°1LA implies mean exchangeability E[Y°|A = a’] = 
E[Y“], but mean exchangeability does not imply exchangeability because distributional parameters other than the mean 
(e.g., variance) may not be independent of treatment. 

Neither full exchangeability Y“ LA nor exchangeability YILA are required to prove that E[Y“] = E[Y|A = q]. 
Mean exchangeability is sufficient. As sketched in the main text, the proof has two steps. First, E[Y|A = a] = 
E[Y°|A = a] by consistency. Second, E[Y*|A = a] = E[Y°] by mean exchangeability. Because exchangeability and 
mean exchangeability are identical concepts for the dichotomous outcomes used in this chapter, we use the shorter term 
“exchangeability” throughout. 














0.3. That is, the risk in the treated (the white part of the diamond) is the 
same as the risk if everybody had been treated (and thus the diamond had 
been entirely white). Of course, the same rationale applies to the untreated: 
the counterfactual risk under no treatment in the population Pr[Y¢-° = 1] 
equals the risk in the untreated Pr[Y = 1|A = 0] = 0.6. The causal risk ratio 
is 0.5 and the causal risk difference is —0.3. In ideal randomized experiments, 
association 7s causation. 

Here is another explanation for exchangeability Y°1LA in a randomized 
experiment. The counterfactual outcome Y“, like one’s genetic make-up, can 
be thought of as a fixed characteristic of a person existing before the treat- 
ment A was randomly assigned. This is because Y° encodes what would have 
been one’s outcome if assigned to treament a and thus does not depend on 
the treatment you later receive. Because treatment A was randomized, it is 
independent of both your genes and Y*. The difference between Y° and your 
genetic make-up is that, even conceptually, you can only learn the value of Y° 
after treatment is given and then only if one’s treatment A is equal to a. 

Caution: Before proceeding, please make sure you understand the difference between 
Y°1LA is different from Y ILA Y° ILA and Y ILA. Exchangeability Y° LA is defined as independence between 
the counterfactual outcome and the observed treatment. Again, this means 
that the treated and the untreated would have experienced the same risk of 
death if they had received the same treatment level (either a = 0 or a = 1). But 
independence between the counterfactual outcome and the observed treatment 
Y°1LA does not imply independence between the observed outcome and the 
observed treatment Y ILA. For example, in a randomized experiment in which 
Suppose there is a causal effect on exchangeability Y°1LA holds and the treatment has a causal effect on the 
some individuals so that Y°=! # outcome, then YILA does not hold because the treatment is associated with 
ye=9_ Since Y = Y4, then Y° the observed outcome. 
with a evaluated at the observed Does exchangeability hold in our heart transplant study of Table 2.1? To 
treatment A is the observed Y4, answer this question we would need to check whether Y°1LA holds for a = 0 
which depends on A and thus will and for a= 1. Take a = 0 first. Suppose the counterfactual data in Table 1.1 
not be independent of A. are available to us. We can then compute the risk of death under no treatment 
Pr[Y °=? = 1|A = 1] = 7/13 in the 13 treated individuals and the risk of death 
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Fine Point 2.1 


Crossover experiments. Suppose we want to estimate the individual causal effect of lightning bolt use A on Zeus’s 
blood pressure Y. We define the counterfactual outcomes Y°=! and Y= to be 1 if Zeus's blood pressure is temporarily 
elevated after calling or not calling a lightning strike, respectively. Suppose we convinced Zeus to use his lightning bolt 
only when suggested by us. Yesterday morning we asked Zeus to call a lightning strike (a = 1). His blood pressure was 
elevated after doing so. This morning we asked Zeus to refrain from using his lightning bolt (a = 0). His blood pressure 
did not increase. We have conducted a crossover experiment in which an individual's outcome is sequentially observed 
under two treatment values. One might argue that, because we have observed both of Zeus’s counterfactual outcomes 
Y*=! = 1 and Y2~-° = 0, using a lightning bolt has a causal effect on Zeus’s blood pressure. However, we now show 
that his argument would generally be incorrect unless the very strong assumptions 1)-3) given in the next paragraph are 
true. 

In crossover experiments, individuals are observed during two or more periods, say t = 0 and t = 1. An individual 
i receives a different treatment value Aj; in each period t. Let Y;,{°°! be the (deterministic) counterfactual outcome at 
t = 1 for individual 7 if treated with a, at t = 1 and ag at t = 0. Let Y;ọ° be defined similarly for t = 0. The individual 
causal effect Y,7°=! — y,4*=° can be identified if the following three conditions hold: i) no carryover effect of treatment: 
Y 21" = Y,f4,, ii) the individual causal effect does not depend on time: yami — Yo = a; for t = 0,1, and ili) the 
counterfactual outcome under no treatment does not depend on time: bie = 2; for t = 0,1. Under these conditions, 
if the individual is treated at time 1 (A;, = 1) but not time 0 (Ajo = 0) then, by consistency, Y;ı — Yio is the individual 
causal effect because Y; — Yio = YAT — Yo? = Yam? — V8 +. Vi — Yg = a: + pi — Bi = ai. Similarly 
if Aj) = 0 and Ajo = 1, Yio — Yin = Q; is the individual level causal effect. 

Condition (i) implies that the outcome Y,*' has an abrupt onset that completely resolves by the next time period. 
Hence, crossover experiments cannot be used to study the effect of heart transplant, an irreversible action, on death, 
an irreversible outcome. See also Fine Point 3.2. 








under no treatment Pr[Y°~° = 1|A = 0] = 3/7 in the 7 untreated individuals. 
Since the risk of death under no treatment is greater in the treated than in the 
untreated individuals, i.e., 7/13 > 3/7, we conclude that the treated have a 
worse prognosis than the untreated, that is, that the treated and the untreated 
are not exchangeable. Mathematically, we have proven that exchangeability 
Y*1LA does not hold for a = 0. (You can check that it does not hold for a = 1 
either.) Thus the answer to the question that opened this paragraph is ‘No’. 


But only the observed data in Table 2.1, not the counterfactual data in 
Table 1.1, are available in the real world. Since Table 2.1 is insufficient to 
compute counterfactual risks like the risk under no treatment in the treated 
Pr[Y*=° = 1|A = 1], we are generally unable to determine whether exchange- 
ability holds in our study. However, suppose for a moment, that we actually 
had access to Table 1.1 and determined that exchangeability does not hold 
in our heart transplant study. Can we then conclude that our study is not 
a randomized experiment? No, for two reasons. First, as you are probably 
already thinking, a twenty-person study is too small to reach definite conclu- 
sions. Random fluctuations arising from sampling variability could explain 
almost anything. We will discuss random variability in Chapter 10. Until 
then, let us assume that each individual in our population represents 1 billion 
individuals that are identical to him or her. Second, it is still possible that 
a study is a randomized experiment even if exchangeability does not hold in 
infinite samples. However, unlike the type of randomized experiment described 
in this section, it would need to be a randomized experiment in which investi- 
gators use more than one coin to randomly assign treatment. The next section 
describes randomized experiments with more than one coin. 
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2.2 Conditional randomization 


Table 2.2 shows the data from our heart transplant randomized study. Besides 
data on treatment A (1 if the individual received a transplant, 0 otherwise) 
and outcome Y (1 if the individual died, 0 otherwise), Table 2.2 also contains 
data on the prognostic factor L (1 if the individual was in critical condition, 
0 otherwise), which we measured before treatment was assigned. We now 
consider two mutually exclusive study designs and discuss whether the data in 
Table 2.2 could have arisen from either of them. 
In design 1 we would have randomly selected 65% of the individuals in the 
population and transplanted a new heart to each of the selected individuals. 
Table 2.2 That would explain why 13 out of 20 individuals were treated. In design 2 





L A Y we would have classified all individuals as being in either critical (L = 1) 
Rheia 0 0 0 or noncritical (L = 0) condition. Then we would have randomly selected 
Kronos 0 0 1 75% of the individuals in critical condition and 50% of those in noncritical 
Demeter 0 0 0 condition, and transplanted a new heart to each of the selected individuals. 
Hades 0 0 0 That would explain why 9 out of 12 individuals in critical condition, and 4 out 
Hestia 0 1 0 of 8 individuals in non critical condition, were treated. 
Poseidon O 1 0 Both designs are randomized experiments. Design 1 is precisely the type of 
Hera 0 1 0 randomized experiment described in Section 2.1. Under this design, we would 
Zeus 0O 1 1 use a single coin to assign treatment to all individuals (e.g., treated if tails, 
Artemis 1 0 1 untreated if heads): a loaded coin with probability 0.65 of turning tails, thus 
Apollo 1 0 1 resulting in 65% of the individuals receiving treatment. Under design 2 we 
Leto 1 0 0 would not use a single coin for all individuals. Rather, we would use a coin 
Ares 1 1 1 with a 0.75 chance of turning tails for individuals in critical condition, and 
Athena 1 1 1 another coin with a 0.50 chance of turning tails for individuals in non critical 
Hephaestus 1 1 1 condition. We refer to design 2 experiments as conditionally randomized ex- 
Aphrodite 1 1 1 periments because we use several randomization probabilities that depend (are 
Cyclope 1 1 1 conditional) on the values of the variable L. We refer to design 1 experiments 
Persephone 1 1 1 as marginally randomized experiments because we use a single unconditional 
Hermes 1 1 0 (marginal) randomization probability that is common to all individuals. 
Hebe 1 1 0 As discussed in the previous section, a marginally randomized experiment 
Dionysus 1 1 0 is expected to result in exchangeability of the treated and the untreated: 





Pr[Y* = 1A = 1] = Pr[Y° = 1A = 0] or Y°1LA for alla. 


In contrast, a conditionally randomized experiment will not generally result 
in exchangeability of the treated and the untreated because, by design, each 
group may have a different proportion of individuals with bad prognosis. 
Thus the data in Table 2.2 could not have arisen from a marginally random- 
ized experiment because 69% treated versus 43% untreated individuals were 
in critical condition. This imbalance indicates that the risk of death in the 
treated, had they remained untreated, would have been higher than the risk 
of death in the untreated. That is, treatment A predicts the counterfactual 
risk of death under no treatment, and exchangeability Y°1LA does not hold. 
Since our study was a randomized experiment, you can safely conclude that 
the study was a randomized experiment with randomization conditional on L. 
Our conditionally randomized experiment is simply the combination of two 
separate marginally randomized experiments: one conducted in the subset of 
individuals in critical condition (L = 1), the other in the subset of individuals 
in non critical condition (L = 0). Consider first the randomized experiment 
being conducted in the subset of individuals in critical condition. In this subset, 
the treated and the untreated are exchangeable. Formally, the counterfactual 
mortality risk under each treatment value a is the same among the treated 
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Conditional exchangeability: 
Y*1LAJL for all a 


If A = 1, the Y°™ is missing data 
and if A = 0, the Y@=! is missing 
data. Data are missing completely 
at random (MCAR) if Pr[A = 
a|L,Y°=1,Y°=°?] = Pr[A = q, 
which holds in a marginally ran- 
domized experiment. Data are 
missing at random (MAR) if the 
probability of A = a conditional on 
the full data (L, Y°=!, Y2=°) only 
depends on the data that woud be 
observed (L,Y%) if A = a, that 
is, Pr[/A = al ye Y°=°] 
Pr[A = a|L, Y°] , which holds 
in a conditional randomized exper- 
iment. The terms MCAR, MAR, 
and NMAR (not missing at ran- 
dom) were introduced by Rubin 
(1976). 


Stratification and effect modifica- 
tion are discussed in more detail in 
Chapter 4. 


Randomized experiments 


and the untreated given that they all were in critical condition at the time of 
treatment assignment. That is, 





Pr[Y* = 1A = 1, L = 1] = Pr[Y* = 1|A = 0, L = 1] or Y°ILA|L = 1 for alla, 





where Y*1LA|Z = 1 means Y° and A are independent given L = 1. Similarly, 
randomization also ensures that the treated and the untreated are exchange- 
able in the subset of individuals that were in noncritical condition, that is, 
Y*1LA|Z = 0. When Y®“1LA|L = l holds for all values l we simply write 
Y°1LA|L. Thus, although conditional randomization does not guarantee un- 
conditional (or marginal) exchangeability Y°1LA, it guarantees conditional 
exchangeability Y*1LA|L within levels of the variable L. In summary, ran- 
domization produces either marginal exchangeability (design 1) or conditional 
exchangeability (design 2). 














We know how to compute effect measures under marginal exchangeabil- 
ity. In marginally randomized experiments the causal risk ratio Pr[Y°=! = 
1]/ Pr[Y °=? = 1] equals the associational risk ratio Pr[Y = 1|A = 1]/ Pr[Y = 
1|A = 0] because exchangeability ensures that the counterfactual risk under 
treatment level a, Pr[Y* = 1], equals the observed risk among those who re- 
ceived treatment level a, Pr[Y = 1|A = a]. Thus, if the data in Table 2.2 had 
been collected during a marginally randomized experiment, Wh causal risk ra- 
7/13 
3/7 ~ 1.26. The 
question is how to compute the causal risk ratio in a conditionally randomized 
experiment. Remember that a conditionally randomized experiment is simply 
the combination of two (or more) separate marginally randomized experiments 
conducted in different subsets of the population, e.g., L = 1 and L = 0. Thus 
we have two options. 





tio would be readily calculated from the data on A and Y as 


First, we can compute the average causal effect in each of these subsets or 
strata of the population. Because association is causation within each subset, 
the stratum-specific causal risk ratio Pr[¥¢=! = 1|L = 1]/ Pr[Y °=? = 1|L = 1] 
among people in critical condition is equal to the stratum-specific associational 
risk ratio Pr[Y = 1|L = 1,A = 1|/Pr[Y = 1|L = 1, A = 0] among people in 
critical condition. And analogously for L = 0. We refer to this method to 
compute stratum-specific causal effects as stratification. Note that the stratum- 
specific causal risk ratio in the subset L = 1 may differ from the causal risk 
ratio in L = 0. In that case, we say that the effect of treatment is modified by 
L, or that there is effect modification by L. 


Second, we can compute the average causal effect Pr[Y¢=! = 1]/ Pr[Y °=? = 
1] in the entire population, as we have been doing so far. Whether our princi- 
pal interest lies in the stratum-specific average causal effects versus the average 
causal effect in the entire population depends on practical and theoretical con- 
siderations discussed in detail in Chapter 4 and in Part III. As one example, 
you may be interested in the average causal effect in the entire population, 
rather than in the stratum-specific average causal effects, if you do not expect 
to have information on L for future individuals (e.g., the variable L is expen- 
sive to measure) and thus your decision to treat cannot depend on the value of 
L. Until Chapter 4, we will restrict our attention to the average causal effect 
in the entire population. The next two sections describe how to use data from 
conditionally randomized experiments to compute the average causal effect in 
the entire population. 


2.3 Standardization 


2.3 Standardization 


Standardized mean 
>), E[Y|L =1,A =a] 
x Pr[L = |] 
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Our heart transplant study is a conditionally randomized experiment: the in- 
vestigators used a random procedure to assign hearts (A = 1) with probability 
50% to the 8 individuals in noncritical condition (L = 0), and with probability 
75% to the 12 individuals in critical condition (L = 1). First, let us focus on 
the 8 individuals—remember, they are really the average representatives of 8 
billion individuals—in noncritical condition. In this group, the risk of death 
among the treated is Pr[Y = 1|L = 0,A = 1] i: and the risk of death 
among the untreated is Pr[Y = 1|L = 0,A = 0] = }. Because treatment 
was randomly assigned to individuals in the group L = 0, i.e., Y°ILA|L = 0, 
the observed risks are equal to the counterfactual risks. That is, in the group 
L = 0, the risk in the treated equals the risk if everybody had been treated, 
Pr[Y = 1|L = 0, A = 1] = Pr[Y*~" = 1|L = 0], and the risk in the untreated 
equals the risk if everybody had been untreated, Pr[Y = 1|L = 0,A = 0] 
Pr[Y7=° = 1|L = 0]. Following a similar reasoning, we can conclude that the 
observed risks equal the counterfactual risks in the group of 12 individuals in 
critical condition, i.e., Pr[Y = 1]L = 1, A = 1] = Pr[Y*"" = 1|L = 1] = 4, and 
Pr[Y = 1|L = 1, A = 0] = Pry eS ead = 2. 

Suppose now that our goal is to compute the causal risk ratio Pr[Y¢=! = 
1]/ Pr[Y °=? = 1]. The numerator of the causal risk ratio is the risk if all 20 
individuals in the population had been treated. From the previous paragraph, 
we know that the risk if all individuals had been treated is + in the 8 individuals 
with L = 0 and Z in the 12 individuals with L = 1. Therefore the risk if all 20 
individuals in the population had been treated will be a weighted average of 
; and 2 in which each group receives a weight proportional to its size. Since 
40% of the individuals (8) are in group L = 0 and 60% of the individuals (12) 
are in group L = 1, the weighted average is $ x 0.4+ 2 x 0.6 = 0.5. Thus the 
risk if everybody had been treated Pr[Y¢=! = 1] is equal to 0.5. By following 
the same reasoning we can calculate that the risk if nobody had been treated 
Pr[Y °=? = 1] is also equal to 0.5. The causal risk ratio is then 0.5/0.5 = 1. 

More formally, the marginal counterfactual risk Pr[Y* = 1] is the weighted 
average of the stratum-specific risks Pr[Y* = 1|L = 0] and Pr[Y* = 1|L = 1] 
with weights equal to the proportion of individuals in the population with L = 0 
and L = 1, respectively. That is, Pr[Y* = 1] = Pr[Y? = 1|L = 0] Pr [L = 0] + 
Pr[Y°® = 1|L = 1] Pr [L = 1]. Or, using a more compact notation, Pr| Y° = 1] = 
X Pr[¥*® = 1|L = l| Pr[L = l], where $, means sum over all values l that 
occur in the population. By conditional exchangeability, we can replace the 
counterfactual risk Pr[Y* = 1|L = l] by the observed risk Pr[Y = 1|L = l, A = 
a] in the expression above. That is, Pr[Y* = 1] = 5°, Pr[Y = 1|L =1,A= 
a] Pr [L = l]. The left-hand side of this equality is an unobserved counterfactual 
risk whereas the right-hand side includes observed quantities only, which can 
be computed using data on L, A, and Y. When, as here, a counterfactual 
quantity can be expressed as function of the distribution (i.e., probabilities) 
of the observed data, we say that the counterfactual quantity is identified or 
identifiable; otherwise, we say it is unidentified or not identifiable. 

The method described above is known in epidemiology, demography, and 
other disciplines as standardization. For example, the numerator $, Pr[Y = 
1|L = l, A = 1] Pr [L = l] of the causal risk ratio is the standardized risk in the 
treated using the population as the standard. In the presence of conditional ex- 
changeability, this standardized risk can be interpreted as the (counterfactual) 
risk that would have been observed had all the individuals in the population 
been treated. 
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The standardized risks in the treated and the untreated are equal to the 
counterfactual risks under treatment and no treatment, respectively. There- 
Pry = 1 
Pr[Y °=? = 1] 
X Pr[Y = 1|L =., A = 1] Pr[L = l] 
X Pr[Y = 1|L =, A = 0] Pr[L = 1] 


fore, the causal risk ratio can be computed by standardization as 





2.4 Inverse probability weighting 


Figure 2.1 is an example of a 
fully randomized causally inter- 
preted structured tree graph or FR- 
CISTG (Robins 1986, 1987) rep- 
resentation of a conditionally ran- 
domized experiment. Did we win 
the prize for the worst acronym 
ever? 


Figure 2.1 


In the previous section we computed the causal risk ratio in a conditionally 
randomized experiment via standardization. In this section we compute this 
causal risk ratio via inverse probability weighting. The data in Table 2.2 
can be displayed as a tree in which all 20 individuals start at the left and 
progress over time towards the right, as in Figure 2.1. The leftmost circle of 
the tree contains its first branching: 8 individuals were in non critical condi- 
tion (L = 0) and 12 in critical condition (L = 1). The numbers in parentheses 
are the probabilities of being in noncritical, Pr |L = 0] = 8/20 = 0.4, or crit- 
ical, Pr [L = 1] = 12/20 = 0.6, condition. Let us follow, for example, the 
branch L = 0. Of the 8 individuals in this branch, 4 were untreated (A = 0) 
and 4 were treated (A = 1). The conditional probability of being untreated 
is Pr [A = 0|L = 0] = 4/8 = 0.5, as shown in parentheses. The conditional 
probability of being treated Pr [A = 1|L = 0] is 0.5 too. The upper right circle 
represents that, of the 4 individuals in the branch (L = 0, A = 0), 3 survived 
(Y = 0) and 1 died (Y = 1). That is, Pr[Y = 0|L = 0, A = 0] = 3/4 and 
Pr [Y = 1|L = 0, A = 0] = 1/4. The other branches of the tree are interpreted 
analogously. The circles contain the bifurcations defined by non treatment 
variables. We now use this tree to compute the causal risk ratio. 
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Fine Point 2.2 


Risk periods. We have defined a risk as the proportion of individuals who develop the outcome of interest during a 
particular period. For example, the 5-day mortality risk in the treated Pr[Y = 1|A = 1] is the proportion of treated 
individuals who died during the first five days of follow-up. Throughout the book we often specify the period when the 
risk is first defined (e.g., 5 days) and, for conciseness, omit it later. That is, we may just say “the mortality risk” rather 
than “the five-day mortality risk.” 

The following example highlights the importance of specifying the risk period. Suppose a randomized experiment 
was conducted to quantify the causal effect of antibiotic therapy on mortality among elderly humans infected with the 
plague bacteria. An investigator analyzes the data and concludes that the causal risk ratio is 0.05, i.e., on average 
antibiotics decrease mortality by 95%. A second investigator also analyzes the data but concludes that the causal risk 
ratio is 1, i.e., antibiotics have a null average causal effect on mortality. Both investigators are correct. The first 
investigator computed the ratio of 1-year risks, whereas the second investigator computed the ratio of 100-year risks. 
The 100-year risk was of course 1 regardless of whether individuals received the treatment. When we say that a treatment 
has a causal effect on mortality, we mean that death is delayed, not prevented, by the treatment. 








r 
oO 
eo 


wo 
P / 
we Treated Sa es + a 


aig a a a 
E ome 8 





Figure 2.2 





The denominator of the causal risk ratio, Pr[Y°=° = 1], is the counterfac- 
tual risk of death had everybody in the population remained untreated. Let 
us calculate this risk. In Figure 2.1, 4 out of 8 individuals with L = 0 were 
untreated, and 1 of them died. How many deaths would have occurred had 
the 8 individuals with L = 0 remained untreated? Two deaths, because if 8 
individuals rather than 4 individuals had remained untreated, then 2 deaths 
rather than 1 death would have been observed. If the number of individuals is 
multiplied times 2, then the number of deaths is also doubled. In Figure 2.1, 
3 out of 12 individuals with L = 1 were untreated, and 2 of them died. How 
many deaths would have occurred had the 12 individuals with L = 1 remained 
untreated? Eight deaths, or 2 deaths times 4, because 12 is 3x 4. That is, if all 
8+ 12 = 20 individuals in the population had been untreated, then 2+ 8 = 10 
would have died. The denominator of the causal risk ratio, Pr[Y °=? = 1], is 
10/20 = 0.5. The first tree in Figure 2.2 shows the population had everybody 
remained untreated. Of course, these calculations rely on the condition that 
treated individuals with L = 0, had they remained untreated, would have had 
the same probability of death as those who actually remained untreated. This 
condition is precisely exchangeability given L = 0. 
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The numerator of the causal risk ratio Pr[Y °=} = 1] is the counterfactual 
risk of death had everybody in the population been treated. Reasoning as in 
the previous paragraph, this risk is calculated to be also 10/20 = 0.5, under 
exchangeability given L = 1. The second tree in Figure 2.2 shows the popu- 
lation had everybody been treated. Combining the results from this and the 
previous paragraph, the causal risk ratio Pr[Y °=! = 1]/ Pr[Y °=? = 1] is equal 
to 0.5/0.5 = 1. We are done. 

Let us examine how this method works. The two trees in Figure 2.2 are 
a simulation of what would have happened had all individuals in the popula- 
tion been untreated and treated, respectively. These simulations are correct 
under conditional exchangeability. Both simulations can be pooled to create a 
hypothetical population in which every individual appears as a treated and as 
an untreated individual. This hypothetical population, twice as large as the 
original population, is known as the pseudo-population. Figure 2.3 shows the 
entire pseudo-population. Under conditional exchangeability Y?1LA|LZ in the 
original population, the treated and the untreated are (unconditionally) ex- 
changeable in the pseudo-population because the L is independent of A. That 
is, the associational risk ratio in the pseudo-population is equal to the causal 
risk ratio in both the pseudo-population and the original population. 


W4=1/f(A\L) 
1/.5=2 


1/.5=2 
1/.5=2 
1/.5=2 
1/.25=4 


1/.25=4 


1/.75=1.33 


Figure 2.3 





1/.75=1.33 


This method is known as inverse probability (IP) weighting. To see why, 

let us look at, say, the 4 untreated individuals with L = 0 in the population 

IP weighted estimators were pro- of Figure 2.1. These individuals are used to create 8 members of the pseudo- 
posed by Horvitz and Thompson population of Figure 2.3. That is, each of them receives a weight of 2, which 
(1952) for surveys in which subjects is equal to 1/0.5. Figure 2.1 shows that 0.5 is the conditional probability of 
are sampled with unequal probabil- staying untreated given L = 0. Similarly, the 9 treated individuals with L = 1 
ities in Figure 2.1 are used to create 12 members of the pseudo-population. That 
is, each of them receives a weight of 1.33 = 1/0.75. Figure 2.1 shows that 0.75 

is the conditional probability of being treated given L = 1. Informally, the 

pseudo-population is created by weighting each individual in the population 
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Technical Point 2.2 


Formal definition of IP weights. An individual's IP weight depends on her values of treatment A and covariate L. 
For example, a treated individual with L = l receives the weight 1/ Pr [A = 1|L = 1], whereas an untreated individual 
with L = I’ receives the weight 1/Pr [A = 0|L = l']. We can express these weights using a single expression for all 
individuals—regardless of their individual treatment and covariate values—by using the probability density function (PDF) 
of A rather than the probability of A. The conditional PDF of A given L evaluated at the values a and / is represented 
by fajz [all], or simply as f [all]. For discrete variables A and L, f [a]l] is the conditional probability Pr [A = a|L = J]. 
In a conditionally randomized experiment, f [a]l] is positive for all 2 such that Pr [L = l] is nonzero. 

Since the denominator of the weight for each individual is the conditional density evaluated at the individual's own 
values of A and L, it can be expressed as the conditional density evaluated at the random arguments A and L (as 
opposed to the fixed arguments a and l), that is, as f [A|L]. This notation, which appeared in Figure 2.3, is used to 
define the IP weights W4 = 1/f [A|L]. It is needed to have a unified notation for the weights because Pr [A = A|L = L] 
is not considered proper notation. 


IP weight: W4 = 1/f [AL] by the inverse of the conditional probability of receiving the treatment level 
that she indeed received. These IP weights are shown in Figure 2.3. 

IP weighting yielded the same result as standardization—causal risk ratio 
equal to 1— in our example above. This is no coincidence: standardization and 
IP weighting are mathematically equivalent (see Technical Point 2.3). In fact, 
both standardization and IP weighting can be viewed as procedures to build 
a new tree in which all individuals receive treatment a. Each method uses a 
different set of the probabilities to build the counterfactual tree: IP weighting 
uses the conditional probability of treatment A given the covariate L (as shown 
in Figure 2.1), standardization uses the probability of the covariate L and the 
conditional probability of outcome Y given A and L. 

Because both standardization and IP weighting simulate what would have 
been observed if the variable (or variables in the vector) L had not been used 
to decide the probability of treatment, we often say that these methods adjust 
for L. In a slight abuse of language we sometimes say that these methods 
control for L, but this “analytic control” is quite different from the “physical 
control” in a randomized experiment. Standardization and IP weighting can 
be generalized to conditionally randomized studies with continuous outcomes 
(see Technical Point 2.3). 

Why not finish this book here? We have a study design (an ideal random- 
ized experiment) that, when combined with the appropriate analytic method 
(standardization or IP weighting), allows us to compute average causal effects. 
Unfortunately, randomized experiments are often unethical, impractical, or un- 
timely. For example, it is questionable that an ethical committee would have 
approved our heart transplant study. Hearts are in short supply and society 
favors assigning them to individuals who are more likely to benefit from the 
transplant, rather than assigning them randomly among potential recipients. 
Also one could question the feasibility of the study even if ethical issues were 
ignored: double-blind assignment is impossible, individuals assigned to medical 
treatment may not resign themselves to forego a transplant, and there may not 
be compatible hearts for those assigned to transplant. Even if the study were 
feasible, it would still take several years to complete it, and decisions must be 
made in the interim. Frequently, conducting an observational study is the least 
bad option. 
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Technical Point 2.3 


Equivalence of IP weighting and standardization. Assume that A is discrete with finite number of values and 
that f [all] is positive for all Z such that Pr [L =] is nonzero. This positivity condition is guaranteed to hold in 
conditionally randomized experiments. Under positivity, the standardized mean for treatment level a is defined as 
I[(A=a)Y 

>> E[Y|A =a, L = l] Pr [L = l] and the IP weighted mean of Y for treatment level a is defined as E aa 
l 

i.e., the mean of Y, reweighted by the IP weight W4 = 1/f [A|L], in individuals with treatment value A = a. The 
indicator function I (A = a) is the function that takes value 1 for individuals with A = a, and 0 for the others. 


We now prove the equality of the IP weighted mean and the standardized mean under positivity. By definition of an 
: I(A=a)Y 1 
expectation, E | = {E/Y|A=a,L=]] f [all] Pr[L = I} 
Fag) | = Frag E! ae 


=> {E[Y|A = a, L = l] Pr [L =1]} where in the final step we cancelled f [a]l] from the numerator and denominator, 





Sie in the first step we did not need to sum over the possible values of A because because for any a/ other than a the 
quantity I(a/ = a) is zero. The proof treats A and L as discrete but not necessarily dichotomous. For continuous L 
simply replace the sum over L with an integral. 

The proof makes no reference to counterfactuals or to causality. However if we further assume conditional ex- 
changeability then both the IP weighted and the standardized means are equal to the counterfactual mean E [Y°]. Here 
we provide two different proofs of this last statement. First, we prove equality of E [Y°] and the standardized mean as 
in the text 
EY |= E[Y"]L = ] Pr [L = 1] = X` E[Y®]A =a, L = l] Pr[L = 1] = > E[Y|A =a, L = l] Pr [L = }] 


l l l 
where the second equality is by conditional exchangeability and positivity, and the third by consistency. Second, we 
prove equality of E[Y“] and the IP weighted mean as follows: 





I(A=a) J ar ‘| . RENA alL] i 
: a AL] Y| is equal to E FTL AZ] Y“] by consistency. Next, because positivity implies f [a| L] is never 0, we 
ave 
E ane" | = efe Erag 1| \ = pfe H 1! Byen} (by conditional exchangeability). 
= E{E[Y°|L]} (because E oe 1| =1) 








=E[Y?| 

When treatment is continuous, which is an unlikely design choice in conditionally randomized experiments, 
E[I (A =a) Y/f (A|L)] is no longer equal to 5°, E[Y|A=a,L=1] Pr[L = l] and thus is biased for E[Y°] even 
under exchangeability. To see this, one can calculate that E[J (A = a) /f (a|l) |Z = l] is equal to 0 rather than 1 if 
we take f (all) to be (a version of) the conditional density of A given L = l (with respect to Lebesgue measure). On 
the other hand, if we continue to take f (all) to be pr(A =a|L=1), the denominator f(a|L = 1) is zero on a set 
with probability 1 so positivity fails. In Section 12.4 we discuss how IP weighting can be generalized to accomodate 
continuous treatments. In Technical Point 3.1, we discuss that the results above do not hold in the absence of positivity, 
even for discrete A. 





Chapter 3 
OBSERVATIONAL STUDIES 


Consider again the causal question “does one’s looking up at the sky make other pedestrians look up too?” After 
considering a randomized experiment as in the previous chapter, you concluded that looking up so many times was 
too time-consuming and unhealthy for your neck bones. Hence you decided to conduct the following study: Find 
a nearby pedestrian who is standing in a corner and not looking up. Then find a second pedestrian who is walking 
towards the first one and not looking up either. Observe and record their behavior during the next 10 seconds. 
Repeat this process a few thousand times. You could now compare the proportion of second pedestrians who 
looked up after the first pedestrian did, and compare it with the proportion of second pedestrians who looked up 
before the first pedestrian did. Such a scientific study in which the investigator observes and records the relevant 
data is referred to as an observational study. 

If you had conducted the observational study described above, critics could argue that two pedestrians may 
both look up not because the first pedestrian’s looking up causes the other’s looking up, but because they both 
heard a thunderous noise above or some rain drops started to fall, and thus your study findings are inconclusive 
as to whether one’s looking up makes others look up. These criticisms do not apply to randomized experiments, 
which is one of the reasons why randomized experiments are central to the theory of causal inference. However, 
in practice, the importance of randomized experiments for the estimation of causal effects is more limited. Many 
scientific studies are not experiments. Much human knowledge is derived from observational studies. Think of 
evolution, tectonic plates, global warming, or astrophysics. Think of how humans learned that hot coffee may cause 
burns. This chapter reviews some conditions under which observational studies lead to valid causal inferences. 


3.1 Identifiability conditions 


Ideal randomized experiments can be used to identify and quantify average 

causal effects because the randomized assignment of treatment leads to ex- 

changeability. Take a marginally randomized experiment of heart transplant 

and mortality as an example: if those who received a transplant had not re- 
For simplicity, this chapter consid- ceived it, they would have been expected to have the same death risk as those 
ers only randomized experiments in who did not actually receive the heart transplant. As a consequence, an asso- 
which all participants remain un-  ciational risk ratio of 0.7 from the randomized experiment is expected to equal 
der follow-up and adhere to their the causal risk ratio. 


assigned treatment throughout the Observational studies, on the other hand, may be much less convincing (for 
entire study. Chapters 8 and 9 dis- an example, see the introduction to this chapter). A key reason for our hesita- 
cuss alternative scenarios. tion to endow observational associations with a causal interpretation is the lack 


of randomized treatment assignment. As an example, take an observational 
study of heart transplant and mortality in which those who received the heart 
transplant were more likely to have a severe heart condition. Then, if those 
who received a transplant had not received it, they would have been expected 
to have a greater death risk than those who did not actually receive the heart 
transplant. As a consequence, an associational risk ratio of 1.1 from the ob- 
servational study would be a compromise between the truly beneficial effect of 
transplant on mortality (which pushes the associational risk ratio to be under 
1) and the underlying greater mortality risk in those who received transplant 
(which pushes the associational risk ratio to be over 1). The best explanation 
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Table 3.1 





Rheia 
Kronos 
Demeter 
Hades 
Hestia 
Poseidon 
Hera 

Zeus 
Artemis 
Apollo 
Leto 

Ares 
Athena 
Hephaestus 
Aphrodite 
Cyclope 
Persephone 
Hermes 
Hebe 
Dionysus 


PREP RPP RP RPRPRPHE REP ErPODOOCOCOCCC OIN 
PRP RP RP RPP RPE RERODOORPRFPRFREFRODOO OR 
DFDOORPRFPHRPREPRPHORHFHFHODCOOFRO NX 


Rubin (1974, 1978) extended Ney- 
man’s theory for randomized ex- 
periments to observational studies. 
Rosenbaum and Rubin (1983) re- 
ferred to the combination of ex- 
changeability and positivity as weak 
ignorability, and to the combination 
of full exchangeability (see Tech- 
nical Point 2.1) and positivity as 
strong ignorability. 


Observational studies 


for an association between treatment and outcome in an observational study 
is not necessarily a causal effect of the treatment on the outcome. 

While recognizing that randomized experiments have intrinsic advantages 
for causal inference, sometimes we are stuck with observational studies to an- 
swer causal questions. What do we do? We analyze our data as if treatment 
had been randomly assigned conditional on measured covariates L—though we 
often know this is at best an approximation. Causal inference from observa- 
tional data then revolves around the hope that the observational study can be 
viewed as a conditionally randomized experiment. 

Informally, an observational study can be conceptualized as a conditionally 
randomized experiment if the following conditions hold: 


1. the values of treatment under comparison correspond to well-defined in- 
terventions that, in turn, correspond to the versions of treatment in the 
data 


2. the conditional probability of receiving every value of treatment, though 
not decided by the investigators, depends only on measured covariates L 


3. the probability of receiving every value of treatment conditional on L is 
greater than zero, i.e., positive 


In this chapter we describe these three conditions in the context of ob- 
servational studies. Condition 1 was referred to as consistency in Chapter 1, 
condition 2 was referred to as exchangeability in the previous chapters, and 
condition 3 was referred to as positivity in Technical Point 2.3. 

We will see that these conditions are often heroic, which explains why causal 
inferences from observational studies are viewed with suspicion. However, if 
the analogy between observational study and conditionally randomized exper- 
iment happens to be correct, then we can use the methods described in the 
previous chapter—IP weighting or standardization—to identify causal effects 
from observational studies. We therefore refer to these conditions as identifi- 
ability conditions or assumptions. For example, in the previous chapter, we 
computed a causal risk ratio equal to 1 using the data in Table 2.2, which arose 
from a conditionally randomized experiment. If the same data, now shown in 
Table 3.1, had arisen from an observational study and the three identifiability 
conditions above held true, we would also compute a causal risk ratio equal to 
1. 

Importantly, in ideal randomized experiments the identifiability conditions 
hold by design. That is, for a conditionally randomized experiment, we would 
only need the data in Table 3.1 to compute the causal risk ratio of 1. In 
contrast, to identify the causal risk ratio from an observational study, we would 
need to assume that the identifiability conditions held, which of course may not 
be true. Causal inference from observational data requires two elements: data 
and identifiability conditions. See Fine Point 3.1 for a more precise definition 
of identifiability. 

When any of the identifiability conditions does not hold, the analogy be- 
tween observational study and conditionally randomized experiment breaks 
down. In that situation, there are other possible approaches to causal inference 
from observational data, which require a different set of identifiability condi- 
tions. One of these approaches is hoping that a predictor of treatment, referred 
to as an instrumental variable, behaves as if it had been randomly assigned con- 
ditional on the measured covariates. We discuss instrumental variable methods 
in Chapter 16. 
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Fine Point 3.1 


Identifiability of causal effects. We say that an average causal effect is (non parametrically) identifiable under a 
particular set of assumptions if these assumptions imply that the distribution of the observed data is compatible with 
a single value of the effect measure. Conversely, we say that an average causal effect is nonidentifiable under the 
assumptions when the distribution of the observed data is compatible with several values of the effect measure. For 
example, if the study in Table 3.1 had arisen from a conditionally randomized experiment in which the probability of 
receiving treatment depended on the value of L (and hence conditional exchangeability Y* LA|L holds by design) then 
we showed in the previous chapter that the causal effect is identifiable: the causal risk ratio equals 1, without requiring 
any further assumptions. However, if the data in Table 3.1 had arisen from an observational study, then the causal risk 
ratio equals 1 only if we supplement the data with the assumption of conditional exchangeability Y°1LA|L. To identify 
the causal effect in observational studies, we need an assumption external to the data, an identifying assumption. In 
fact, if we decide not to supplement the data with the identifying assumption, then the data in Table 3.1 are consistent 
with a causal risk ratio 


e lower than 1, if risk factors other than L are more frequent among the treated. 
e greater than 1, if risk factors other than L are more frequent among the untreated. 


e equal to 1, if all risk factors except L are equally distributed between the treated and the untreated or, equivalently, 
if Y°ILAJL. 


This chapter discusses the three identifiability conditions for nonparametric identification of average causal effects. 
In Chapter 16, we describe alternative identifiability conditions which suffice for nonparametric identification of average 
causal effects. 


Not surprisingly, observational methods based on the analogy with a con- 
ditionally randomized experiment have been traditionally privileged in disci- 
plines in which this analogy is often reasonable (e.g., epidemiology), whereas 
instrumental variable methods have been traditionally privileged in disciplines 
in which observational studies cannot often be conceptualized as condition- 
ally randomized experiments given the measured covariates (e.g., economics). 
Until Chapter 16, we will focus on causal inference approaches that rely on 
the ability of the observational study to emulate a conditionally randomized 
experiment. We now describe in more detail each of the three identifiability 
conditions. 


3.2 Exchangeability 


We have already said much about exchangeability Y°1LA. In marginally (i.e., 
An independent predictor of the unconditionally) randomized experiments, the treated and the untreated are 
outcome is a covariate associated exchangeable because the treated, had they remained untreated, would have 
with the outcome Y within levels of | experienced the same average outcome as the untreated did, and vice versa. 
treatment. For dichotomous out- This is so because randomization ensures that the independent predictors of 
comes, independent predictors of the outcome are equally distributed between the treated and the untreated 
the outcome are often referred to groups. 
as risk factors for the outcome. For example, take the study summarized in Table 3.1. We said in the pre- 

vious chapter that exchangeability clearly does not hold in this study because 

69% treated versus 43% untreated individuals were in critical condition L = 1 
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Fine Point 3.2 introduces the rela- 
tion between lack of exchangeabil- 
ity and confounding. 


Observational studies 


at baseline. This imbalance in the distribution of an independent outcome 
predictor is not expected to occur in a marginally randomized experiment (ac- 
tually, such imbalance might occur by chance but let us keep working under 
the illusion that our study is large enough to prevent chance findings). 

On the other hand, an imbalance in the distribution of independent out- 
come predictors L between the treated and the untreated is expected by design 
in conditionally randomized experiments in which the probability of receiving 
treatment depends on L. The study in Table 3.1 is such a conditionally random- 
ized experiment: the treated and the untreated are not exchangeable—because 
the treated had, on average, a worse prognosis at the start of the study—but 
the treated and the untreated are conditionally exchangeable within levels of 
the variable L. In the subset L = 1 (critical condition), the treated and the 
untreated are exchangeable because the treated, had they remained untreated, 
would have experienced the same average outcome as the untreated did, and 
vice versa. And similarly for the subset L = 0. An equivalent statement: 
conditional exchangeability Y°1LA|L holds in conditionally randomized ex- 
periments because, within levels of L, all other predictors of the outcome are 
equally distributed between the treated and the untreated groups. 

Back to observational studies. When treatment is not randomly assigned 
by the investigators, the reasons for receiving treatment are likely to be associ- 
ated with some outcome predictors. That is, like in a conditionally randomized 
experiment, the distribution of outcome predictors will generally vary between 
the treated and untreated groups in an observational study. For example, the 
data in Table 3.1 could have arisen from an observational study in which doc- 
tors tend to direct the scarce heart transplants to those who need them most, 
i.e., individuals in critical condition L = 1. In fact, if the only outcome pre- 
dictor that is unequally distributed between the treated and the untreated is 
L, then one can refer to the study in Table 3.1 as either (i) an observational 
study in which the probability of treatment A = 1 is 0.75 among those with 
L = 1 and 0.50 among those with L = 0, or (ii) a (non blinded) conditionally 
randomized experiment in which investigators randomly assigned treatment 
A = 1 with probability 0.75 to those with L = 1 and 0.50 to those with L = 0. 
Both characterizations of the study are logically equivalent. Under either char- 
acterization, conditional exchangeability Y*1LA|L holds and standardization 
or IP weighting can be used to identify the causal effect. 

Of course, the crucial question for the observational study is whether L is 
the only outcome predictor that is unequally distributed between the treated 
and the untreated. Sadly, the question must remain unanswered. For example, 
suppose the investigators of our observational study strongly believe that the 
treated and the untreated are exchangeable within levels of L. Their reasoning 
goes as follows: “Heart transplants are assigned to individuals with low proba- 
bility of rejecting the transplant, that is, a heart with certain human leukocyte 
antigen (HLA) genes will be assigned to an individual who happen to have 
compatible genes. Because HLA genes are not predictors of mortality, it turns 
out that treatment assignment is essentially random within levels of L.” Thus 
our investigators are willing to work under the assumption that conditional 
exchangeability Y* 1LA|L holds. 

The key word is “assumption.” No matter how convincing the investiga- 
tors’ story may be, in the absence of randomization, there is no guarantee that 
conditional exchangeability holds. For example, suppose that, unknown to the 
investigators, doctors prefer to transplant hearts into nonsmokers. If two in- 
dividual with L = 1 have similar HLA genes, but one of them is a smoker 
(U = 1) and the other one is a nonsmoker (U = 0), the one with U = 1 has 
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Fine Point 3.2 


Crossover randomized experiments. In Fine Point 2.1, we described crossover experiments in which an individual 
is observed during two or more periods—say t = 0 and ¢t = 1—and the individual receives a different treatment value 
in each period. We showed that individual causal effects can be identified in crossover experiments when the following 
three strong conditions hold: i) no carryover effect of treatment: Y,/27't = Y,f1,, ii) the individual causal effect does 
not depend on time: KA= — VA = a; for t = 0,1, and iii) the counterfactual outcome under no treatment does 
not depend on time: yo = 2; for t = 0,1. No randomization was required. We now turn our attention to crossover 
randomized experiments in which the order of treatment values that an individual receives is randomly assigned. 
Randomized treatment assignment becomes important when, due to possible temporal effects, we do not assume 
iii) holds. For simplicity, assume that every individual is randomized to either (Aj; = 1, Ajo = 0) or (A; = 0, Ajo = 1) 
with probability 0.5. Let Lee — Yoo. = ri. Then, under i) and ii) and consistency, if Aj = 0 and Aj, = 1, 
then Y;; — Yio = a3 + ri, and if A; = 0 and Ajo = 1, then Yio — Yii = a; — ri. Because r; is unknown we can 
no longer identify individual causal effects but, since A;; and A;o are randomized and therefore independent of r;, the 
mean of (Yi1 — Yio) Ai + (Yio — Yi1) Aio estimates the average causal effect, i.e., E[a;]. If we only assume i), then 
this mean estimates the average of the average treament effects at times 0 and 1, i.e., (E [œi] + E[aio]) /2, where 


._ — ypat=l _ yar=0 
aie = Vit Ype 


In conclusion, if assumption 1) of no carryover effect holds, then a crossover experiment can be used to estimate 
average causal effects. However, for the type of treatments and outcomes we study in this book, the assumption of no 


carryover effect is implausible. 





We use U to denote unmeasured 
variables. Because unmeasured 
variables cannot be used for stan- 
dardization or IP weighting, the 
causal effect cannot be identified 
when the measured variables L are 
insufficient to achieve conditional 
exchangeability. 








To verify conditional exchange- 
ability, one needs to confirm 
that Pr[Y* = 1|A=a,L=] 


Pr[Y°" =1|A#4a,L = l]. But this 
is logically impossible because, for 
individuals who do not receive 
treatment a (A 4 a) the value of 
Y® is unknown and so the right 
hand side cannot be empirically 
evaluated. 


a lower probability of receiving treatment A = 1. When the distribution of 
smoking, an important outcome predictor, differs between the treated (with 
lower proportion of smokers U = 1) and the untreated (with higher proportion 
of smokers) in the stratum L = 1, conditional exchangeability given L does not 
hold. Importantly, collecting data on smoking would not prevent the possibil- 
ity that other imbalanced outcome predictors, unknown to the investigators, 
remain unmeasured. 


Thus exchangeability Y° LL A|Z may not hold in observational studies. Specif- 
ically, conditional exchangeability Y* LA|L will not hold if there exist unmea- 
sured independent predictors U of the outcome such that the probability of 
receiving treatment A depends on U within strata of L. Worse yet, even if 
conditional exchangeability Y“1LA|L held, the investigators cannot empiri- 
cally verify that is actually the case. How can they check that the distribution 
of smoking is equal in the treated and the untreated if they have not collected 
data on smoking? What about all the other unmeasured outcome predictors 
U that may also be differentially distributed between the treated and the un- 
treated? When analyzing an observational study under conditional exchange- 
ability, we must hope that our expert knowledge guides us correctly to collect 
enough data so that the assumption is at least approximately true. 


Investigators can use their expert knowledge to enhance the plausibility 
of the conditional exchangeability assumption. They can measure many rele- 
vant variables L (e.g., determinants of the treatment that are also independent 
outcome predictors), rather than only one variable as in Table 3.1, and then as- 
sume that conditional exchangeability is approximately true within the strata 
defined by the combination of all those variables L. Unfortunately, no mat- 
ter how many variables are included in L, there is no way to test that the 
assumption is correct, which makes causal inference from observational data 
a risky task. The validity of causal inferences requires that the investigators’ 
expert knowledge is correct. This knowledge, encoded as the assumption of 
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3.3 Positivity 


The positivity condition is some- 
times referred to as the experimen- 
tal treatment assumption. 


Positivity: Pr[A=a|L=I] > 0 
for all values J with Pr [L = l] 40 
in the population of interest. 


Observational studies 


exchangeability conditional on the measured covariates, supplements the data 
in an attempt to identify the causal effect of interest. 


Some investigators plan to conduct an experiment to compute the average 
effect of heart transplant A on 5-year mortality Y. It goes without saying that 
the investigators will assign some individuals to receive treatment level A = 1 
and others to receive treatment level A = 0. Consider the alternative: the 
investigators assign all individuals to either A = 1 or A = 0. That would be 
silly. With all the individuals receiving the same treatment level, computing the 
average causal effect would be impossible. Instead we must assign treatment 
so that, with near certainty, some individuals will be assigned to each of the 
treatment groups. In other words, we must ensure that there is a probability 
greater than zero—a positive probability—of being assigned to each of the 
treatment levels. This is the positivity condition. 

We did not emphasize positivity when describing experiments because pos- 
itivity is taken for granted in those studies. In marginally randomized ex- 
periments, the probabilities Pr [A = 1] and Pr[A = 0] are both positive by 
design. In conditionally randomized experiments, the conditional probabili- 
ties Pr[A = 1|L = l] and Pr[A = 0|L = l] are also positive by design for all 
levels of the variable L that are eligible for the study. For example, if the 
data in Table 3.1 had arisen from a conditionally randomized experiment, the 
conditional probabilities of assignment to heart transplant would have been 
Pr [A = 1|£ = 1] = 0.75 for those in critical condition and Pr [A = 1|L = 0] = 
0.50 for the others. Positivity holds, conditional on L, because neither of 
these probabilities is 0 (nor 1, which would imply that the probability of no 
heart transplant A = 0 would be 0). Thus we say that there is positivity if 
Pr [A = a|L = l| > 0 for all a involved in the causal contrast. Actually, this 
definition of positivity is incomplete because, if our study population were re- 
stricted to the group L = 1, then there would be no need to require positivity 
in the group L = 0. Positivity is only needed for the values l that are present 
in the population of interest. 

In addition, positivity is only required for the variables L that are required 
for exchangeability. For example, in the conditionally randomized experiment 
of Table 3.1, we do not ask ourselves whether the probability of receiving 
treatment is greater than 0 in individuals with blue eyes because the variable 
“having blue eyes” is not necessary to achieve exchangeability between the 
treated and the untreated. (The variable “having blue eyes” is not an inde- 
pendent predictor of the outcome Y conditional on L and A, and was not even 
used to assign treatment.) That is, the standardized risk and the IP weighted 
risk are equal to the counterfactual risk after adjusting for L only; positivity 
does not apply to variables that, like “having blue eyes”, do not need to be 
adjusted for. 

In observational studies, neither positivity nor exchangeability are guaran- 
teed. For example, positivity would not hold if doctors always transplant a 
heart to individuals in critical condition L = 1, i.e., if Pr[A = 0|L = 1] = 0, 
as shown in Figure 3.1. A difference between the conditions of exchangeabil- 
ity and positivity is that positivity can sometimes be empirically verified (see 
Chapter 12). For example, if Table 3.1 corresponded to data from an observa- 
tional study, we would conclude that positivity holds for L because there are 
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Figure 3.1 


people at all levels of treatment (i.e., A = 0 and A = 1) in every level of L 
(i.e., L = 0 and L = 1).Our discussion of standardization and IP weighting 
in the previous chapter was explicit about the exchangeability condition, but 
only implicitly assumed the positivity condition (explicitly in Technical Point 
2.3). Our previous definitions of standardized risk and IP weighted risk are 
actually only meaningful when positivity holds. To intuitively understand why 
the standardized and IP weighted risk are not well-defined when the positiv- 
ity condition fails, consider Figure 3.1. If there were no untreated individuals 
(A = 0) with L = 1, the data would contain no information to simulate what 
would have happened had all treated individuals been untreated because there 
would be no untreated individuals with L = 1 that could be considered ex- 
changeable with the treated individuals with L = 1. See Technical Point 3.1 
for details. 





3.4 Consistency: First, define the counterfactual outcome 


Robins and Greenland (2000) ar- 
gued that well-defined counterfac- 
tuals, or mathematically equivalent 
concepts, are necessary for mean- 
ingful causal inference. 


Consistency means that the observed outcome for every treated individual 
equals her outcome if she had received treatment, and that the observed out- 
come for every untreated individual equals her outcome if she had remained 
untreated, that is, Y° = Y for every individual with A = a. This statement 
seems so obviously true that some readers may be wondering whether there 
are any situations in which consistency does not hold. After all, if I take as- 
pirin A = 1 and I die (Y = 1), isn’t it the case that my outcome Y*=! under 
aspirin also equals 1? The apparent simplicity of the consistency condition 
is deceptive. Let us unpack consistency by explicitly describing its two main 
components: (1) a precise definition of the counterfactual outcomes Y% via a 
detailed specification of the superscript a, and (2) the linkage of the counter- 
factual outcomes to the observed outcomes. This section deals with the first 
component of consistency. 
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Technical Point 3.1 


Positivity for standardization and IP weighting. We have defined the standardized mean for treatment level 


a as )> E[Y|A =a, L= l] Pr[L=1]. However, this expression can only be computed if the conditional quan- 
l 
tity E[Y|A = a, L = l] is well defined, which will be the case when the conditional probability Pr [A = a|L = l] is 


greater than zero for all values l that occur in the population. That is, when positivity holds. (Note the statement 
Pr [A = a|L = l] > 0 for all with Pr [L = l] # 0 is effectively equivalent to f [a| L] > 0 with probability 1.) Therefore, 
the standardized mean is defined as 


Y E[Y]A =a, L =l] Pr(L=1] if Pr[A=a]L=1]>0 for alll with Pr [L = l] #0, 
l 


and is undefined otherwise. The standardized mean can be computed only if, for each value of the covariate L in the 

population, there are some individuals that received the treatment level a. 

I(A=a)Y I(A=a)Y 
f [AIL] f lal L] 


is undefined because the undefined ratio 8 occurs in computing the expectation. On the 


I(A=a)Y 
f [AIL] 


zero. However, it is now a biased estimate of the counterfactual mean even under exchangeability. In particular, when 


The IP weighted mean E 
I(A=a)Y 
f la]L] 
other hand, the IP weighted mean E | 


is no longer equal to E | when positivity does not hold. 
Specifically, E | 


is always well defined since its denominator f [A|L] can never be 


I(A=a)Y 
positivity fails to hold, E ae is equal to Pr [L € Q(a)] > E[Y|A=a,L=1,L€ Q(a)] Pr[L=||L € Q(a)| 
l 
where Q(a) = {l; Pr (A = a| L = l) > 0} is the set of values J for which A = a may be observed with positive probability. 
Therefore, under exchangeability, E Hir equals E[Y°|L € Q(a)] Pr [L € Q(a)]. 


From the definition of Q(a), Q(0) cannot equal Q(1) when A is binary and positivity does not hold. In this case 
Li Eata 
f [AIL] f [AIZ] 
is a contrast between two different groups. Under positivity, Q(1) = Q(0) and the contrast is the average causal effect 
if exchangeability holds. 





the contrast E | has no causal interpretation, even under exchangeability, because it 


Consider again a randomized experiment to compute the causal effect of 
heart transplant A on 5-year mortality Y. Before enrolling patients in the 
study, the investigators wrote a protocol in which the two interventions of 
interest—heart transplant A = 1 and medical therapy A = 0—were described 
in detail. For example, the investigators specified that individuals assigned to 
heart transplant A = 1 were to receive certain pre-operative procedures, anes- 
thesia, surgical technique, post-operative care, and immunosuppressive ther- 
apy. Had the protocol not specified these details, it is possible that each doctor 

Fine Point 1.2 introduced the con- had conducted a different version of the treatment “heart transplant”, perhaps 
cept of multiple versions of treat- using her preferred surgical technique or immunosuppressive therapy. 


ment. A problem arises if different versions of treatment have different causal 


effects. For example, the average causal effect of “heart transplant” in a study 
in which most doctors used a traditional surgical technique may differ from that 
in a study in which most doctors used a novel surgical technique. Therefore, 
when referring to “the causal effect of heart transplant A on mortality” , we need 
to specifiy the versions a of treatment A that are of interest. If the treatment 
values a are not well defined, then the counterfactual outcomes Y“ are not well 
defined, which in turn means that the causal effect Pr[Y¢! = 1] — Pr[Y °=? = 
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For simplicity, we consider the usual 
definition of obesity (body mass in- 
dex>30), More sophisticated defin- 
itions of adiposity might be desir- 
able, but using them would compli- 
cate the exposition without funda- 
mentally altering the main points. 


Part Ill of this book is devoted 
to interventions that, like interven- 
tions on obesity, are sustained over 
time. In this chapter we ignore the 
definitions (and notation) that are 
required for a formal discussion of 
sustained interventions. 


Hernán and Taubman (2008) dis- 
cuss the tribulations of two world 
leaders—a despotic king and a clue- 
less president—who tried to esti- 
mate the effect of obesity in their 
own countries. 


1] is not well defined. Ideally, the protocols of randomized experiments will 
precisely specify the treatment values a assigned to each individual, so that 
their counterfactual outcomes Y® are well defined. In observational studies, 
investigators will need to specify the values a under study as unambiguously as 
possible. While this task is relatively straightforward for medical interventions, 
like heart transplant, it is much harder for treatments that do not correspond 
to actual interventions in the real world. 

Suppose that a colleague of ours wishes to quantify the causal effect of 
obesity A at age 40 on the risk of mortality Y by age 50 in a certain population. 
Formally, the causal effect is defined by a contrast between the risk if all 
individuals had been obese Pr[Y“=! = 1] and the risk if all individuals had 
been nonobese Pr[Y°=° = 1] at age 40. But what exactly is meant by “the 
risk if all individuals had been obese”? The answer is not clear because there 
are many different ways in which an individual could have become obese at 
age 40. For example, an individual might be obese at age 40 after having been 
obese for 20 years, or after having been obese for 2 years only. That is, there 
are multiple versions of the treatment A = 1 defined by duration, recency, and 
intensity of obesity. Because each of these versions may have a different effect 
on mortality, our colleague needs to provide a detailed definition of the version 
of obesity at age 40 that he is interested in. Otherwise, the “causal effect of 
obesity A at age 40 on mortality at age 50” will be ill-defined. 

But, even if our colleague were able to define the duration, recency, and 
intensity of obesity A = 1, other aspects of the intervention would also need to 
be specified. In particular, our colleague would need to specify how to intervene 
on body weight to ensure that each individual experiences treatment value A = 
1. For example, he might consider a genetic modification to increase fat tissue 
in both waist and coronary arteries, or a regime of extreme physical inactivity 
with high caloric intake, or the replacement of the intestinal microbiota, or 
surgery, or a combination of these and other interventions. The problem is 
that each of these options may have different effects on mortality even if they 
all could somehow set adiposity at the same level. 

Take Zeus, who is obese at age 40 (A = 1) and had a fatal myocardial 
infarction at age 49 (Y = 1). Zeus had genes that predisposed him to large 
amounts of fat tissue in both his waist and his coronary arteries, so he died 
despite exercising moderately, keeping a healthy diet, and having a favorable 
intestinal microbiota. If, contrary to fact, his genes had been neutral but he 
had become obese (A = 1) after a lifetime of lack of exercise, too many calories 
in the diet, and an unfavorable intestinal microbiota, then he would not have 
died by age 50 (Y = 0). Therefore, what is Zeus’s counterfactual outcome 
Y=! under “obesity” a = 1? We have just said that he died under one set 
of circumstances that led to obesity A = 1, but would not have died under 
another set of circumstances that would have also led to obesity A = 1. The 
counterfactual outcome Y°=! under a = 1 is ill-defined. 

The counterfactual outcome Y?~° if Zeus had been nonobese is also ill- 
defined. If Zeus had not been obese, he might have either died or not died 
by age 50, depending on how he managed to remain nonobese. For example, 
suppose a nonobese Zeus would have died by age 50 if he had been nonobese 
after a lifetime of exercise (cause of death: a bicycle accident), cigarette smok- 
ing (cause of death: lung cancer), or bariatric surgery (cause of death: adverse 
reaction to anesthesia), and would have survived if he had been nonobese after 
a lifetime of a healthy diet (fewer calories from devouring his children), more 
favorable genes (less visceral fat tissue), or a different microbiota (less fat ab- 
sorption). Because it is unclear which version of “no obesity” A = 0 we are 
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Questions about the effect of obe- 
sity on job discrimination—as mea- 
sured by the proportion of job appli- 
cants called for a personal interview 
after the employer reviews the ap- 
plicant's resume and photograph— 
are less vague. Because the treat- 
ment is “obesity as perceived by the 
employer,” the mechanisms that led 
to obesity may be irrelevant. 


The phrase “no causation with- 
out manipulation” (Holland 1986) 
captures the idea that meaningful 
causal inference requires sufficiently 
well-defined interventions (versions 
of treatment). However, bear in 
mind that sufficiently well-defined 
interventions may not be humanly 
feasible, or practicable, interven- 
tions at a particular time in his- 
tory. For example, the causal ef- 
fect of genetic variants on human 
disease was sufficiently well defined 
even before the existence of tech- 
nology for genetic modification. 


Observational studies 


considering, the counterfactual outcome Y*~° under a = 0 is ill-defined. 

Ill-defined counterfactual outcomes result in vague causal questions. If our 
colleague is interested in the effect of obesity A = 1 on mortality, he will have 
to work harder to define the counterfactual outcomes Y°~? and Y¢=!. An- 
other example: if interested in the causal effect of exercise, we might need 
to define the duration, frequency, intensity, and type of exercise (swimming, 
running, playing basketball...), how the time devoted to exercise would other- 
wise be spent (playing with your children, rehearsing with your band, watching 
television...), etc. 

Note that absolute precision in the definition of the treatment is not needed 
for useful causal inference. For example, for the causal effect of exercise, scien- 
tists agree that the benefits of running clockwise around your neighborhood’s 
park are the same as those of running counterclockwise. Therefore, when de- 
scribing the treatment “lifetime exercise,” the direction of the running need 
not be specified. This and other aspects of the treatment are deemed to be 
irrelevant because varying them would not lead to different counterfactual out- 
comes. That is, we only need sufficiently well-defined interventions a for which 
no meaningful vagueness remains. 

Which begs the question of “How do we know that a treatment is sufficiently 
well-defined” or, equivalently, that no meaningful vagueness remains? The 
answer is “We don’t.” Declaring a treatment sufficiently well-defined is a matter 
of agreement among experts based on the available substantive knowledge. 
Today we agree that the direction of running is irrelevant, but future research 
might prove us wrong if it is demonstrated that, say, leaning the body to 
the right, but not to the left, while running is harmful. At any point in 
history, experts who write the protocols of randomized experiments make an 
attempt to eliminate as much vagueness as possible by employing the subject- 
matter knowledge at their disposal. However, some vagueness is inherent to 
all causal questions. The vagueness of causal questions can be reduced by a 
more detailed specification of treatment, but cannot be completely eliminated. 
Yet the degree of vagueness is especially high in observational studies with 
causal questions involving biological (e.g., body weight, LDL-cholesterol) or 
social (e.g., socioeconomic status) “treatments.” 

The above discussion illustrates an intrinsic feature of causal inference: the 
articulation of causal questions is contingent on domain expertise and infor- 
mal judgment. What we view as a scientifically meaningful causal question at 
present may turn out to be viewed as too vague in the future after learning 
that finer components of the treatment affect the outcome and therefore the 
magnitude of the causal effect. Years from now, scientists will probably refine 
our obesity question in terms of cellular modifications which we barely under- 
stand at this time. Again, the term sufficiently well-defined treatment relies 
on expert consensus, which by definition changes over time. Fine Point 3.3 
describes an alternative, but logically equivalent way, to make causal questions 
more precise. 

At this point, some readers may rightly note that the process of better spec- 
ifying the treatment may alter the original question. We started by declaring 
our colleague’s interest in the effect of obesity, but we ended up by discussing 
hypothetical interventions on exercise. The more we focus on providing a suffi- 
ciently well-defined causal interpretation to our analyses, the farther from the 
original question we seem to get. But that is a good thing. Refining the causal 
question, until it is agreed that no meaningful vagueness remains, is a funda- 
mental component of causal inference. Declaring our interest in “the effect of 
obesity” is just a starting point for a discussion with our colleagues. During 
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Fine Point 3.3 


Possible worlds. Some philosophers of science define causal contrasts using the concept of “possible worlds.” The 
actual world is the way things actually are. A possible world is a way things might be. Imagine a possible world a where 
everybody receives treatment value a, and a possible world a’ where everybody receives treatment value a’. The mean 
of the outcome is E[Y“] in the first possible world and E[Y“’] in the second one. These philosophers say that there is 
an average causal effect if E[Y“] + E[Y“] and the worlds a and a’ are the two worlds closest to the actual world where 
all individuals receive treatment value a and a’, respectively. 

We introduced an individual's counterfactual outcome Y® as her outcome under a sufficiently well-defined inter- 
vention that assigned treatment value a to her. These philosophers prefer to think of the counterfactual Y° as the 
outcome in the possible world that is closest to our world and where the individual was treated with a. Both definitions 
are equivalent when the only difference between the closest possible world and the actual world is that the intervention 
of interest took place. The possible worlds formulation of counterfactuals replaces the sometimes difficult problem of 
specifying the intervention of interest by the equally difficult problem of describing the closest possible world that is 
minimally different from the actual world. Stalnaker (1968) and Lewis (1973) proposed counterfactual theories based 
on possible worlds. 





that discussion, we will sharpen the causal question by refining the specification 
of the treatment until, hopefully, a consensus is reached. The more precisely 
we define the treatment, the fewer opportunities for miscommunication among 
scientists exist, especially when the numerical estimates of causal effect do not 
agree across studies. 

So far we have only reviewed the first component of consistency: the spec- 
ification of sufficiently well-defined treatments. But a relatively unambiguous 
interpretation of numerical estimates also requires the second component of 
consistency. 


3.5 Consistency: Second, link counterfactuals to the observed data 


Inspired by the arguments in the previous section, our colleague decided to 
transform his vague causal question about the effect of obesity on mortality by 
age 50 into a more precise causal question. He is now interested in the following 
intervention (a = 1): “at age 18 and through age 40, put every individual on 
a stringent mandatory diet that guarantees that they would never weigh more 
This hypothetical intervention was than their weight at the age of 18 years.” Specifically, each individual is weighed 
described by Robins (2008). The every day starting on the day before his eighteenth birthday. Whenever the 
hypothetical intervention was re- weight is greater than the baseline weight at 18 years, the individual’s caloric 
stricted to men in order to avoid intake is restricted, without changing his usual mix of calorie sources and 
the complicating issue of how much micronutrients, until the time (usually within 1-3 days) that the individual 
weight gain to allow during preg- falls below baseline weight. Thus, ignoring errors of a kilogram or two, no 
nancy. individual would ever weigh more than his baseline weight through age 40. No 
instructions or restrictions are given concerning exercise at any time or diet 
during non-calorie-restricted periods. The comparison intervention (a = 0) is 
“do not intervene.” 

Suppose experts agree that these treatment values a = 1 and a = 0 are 
sufficiently well-defined and, therefore, that no meaningful vagueness remains 
in the specification of the counterfactual outcomes Y°=! and Y°=°, We can 
now shift our attention to the equal sign in the consistency condition Y* = Y 
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See Technical Point 3.2 for addi- 
tional discussion on the vagueness 
of causal inference when the ver- 
sions of treatment are unknown. 


Treatment-variation irrelevance 
was defined in Fine Point 1.2. 
Formally, this conditions holds if, 
for any two versions A(r) and 
A'(r) of compound treatment 


Rar, YP = ype O for all i 
and r , where yr") is individual 
i's counterfactual outcome under 
version A(r) = a(r) of compound 
treatment R =r. 


For an expanded dicussion of the 
issues described in Sections 3.4 and 
3.5, see the text and references in 
Hernán (2016), and in Robins and 
Weissman (2016). 


Observational studies 


for individuals with A = a. 

To fix ideas, let us consider Ares, who maintained an approximately con- 
stant weight between the ages of 18 and 40 years despite not receiving our 
colleague’s stringent intervention a = 1. Rather, Ares maintained his baseline 
weight because of a mixture of good genes (from Hera) and vigorous physical 
activity (from frequent war combat). Thus Ares’s observed treatment value 
was not A = 1 and therefore his observed outcome Y does not necessarily 
equal the counterfactual outcome Y=! that he would have experienced if he 
had received our colleague’s hypothetical intervention a = 1. 

To preserve the link between the counterfactual outcomes Y°=! and the ob- 
served outcomes Y, we have to ensure that only individuals receiving treatment 
version a = 1 are considered as treated individuals (A = 1) in the analysis, and 
similarly for the untreated. The implication is that, if we want to quantify the 
causal effect Pr[¥¢=! = 1] — Pr[Y °=? = 1] using observational data, we need 
data in which some individuals received treatment values consistent with a = 1 
and a = 0, that is, we need (unconditional) positivity. Being able to describe 
a well-defined intervention a, as our colleague did, is not helpful if the inter- 
vention cannot be linked to the observed data, that is, if we cannot reasonably 
assume that the equality Y* = Y holds for at least some individuals. 

But restriction to the treatment value a of interest is impossible when, as 
it often happens, our data are not sufficiently rich. This problem would arise, 
for example, in an “obesity study” that collects data on body weight at age 40, 
but no data on the individual’s lifetime history of weight, exercise, and diet. 

One way out of this problem is to assume that the effects of all versions 
of treatment are identical—that is, if there is treatment-variation irrelevance. 
In some cases, this may be a good approximation. For example, if interested 
in the causal effect of high versus normal blood pressure on stroke, empirical 
evidence suggests that lowering blood pressure through different pharmaco- 
logical mechanisms results in similar outcomes. We might then argue that 
a precise definition of the treatment “blood pressure” is unnecessary to link 
the potential and observed outcomes. In other cases, however, the validity of 
the assumption is more questionable. For example, if interested in the aver- 
age causal effect of weight maintenance on death, empirical evidence suggests 
that some interventions would increase the risk (e.g., continuation of smoking), 
whereas others would decrease it (e.g., moderate exercise). In practice, many 
observational analyses implictily assume treatment-variation irrelevance when 
making causal inferences about treatments with multiple versions. 

In summary, ill-defined treatments like “obesity” complicate the interpre- 
tation of causal effect estimates (previous section), but so do sufficiently well- 
defined treatments that are absent in the data (this section). Detecting a mis- 
match between the treatment values of interest and the data at hand requires 
a careful characterization of the versions of treatment that operate in the pop- 
ulation. Such characterization may be simple in experiments (i.e., whatever 
intervention investigators use to assign treatment) and relatively straightfor- 
ward in some observational analyses (e.g., those studying the effects of medical 
treatments), but difficult or impossible in many observational analyses that 
study the effects of biological and social factors. 

Of course, the characterization of the treatment versions present in the 
data would be unnecessary if experts explicitly agreed that all versions have a 
similar causal effect. However, because experts are fallible, the best we can do 
is to make these discussions and our assumptions as transparent as possible, so 
that others can directly challenge our arguments. The next section describes 
a procedure to achieve that transparency. 


3.6 The target trial 


3.6 The target trial 


The target trial—or its logical 
equivalents—is central to the 
causal inference framework. Dorn 


(1953), Cochran (1972), Rubin 
(1974), Feinstein (1971), and 
Dawid (2000) used it. Robins 


(1986) generalized the concept to 
time-varying treatments. 


Hernan and Robins (2016) reviewed 
the key components of the target 
trial that need to be specified— 
regardless of whether the causal 
inference is based on a random- 
ized experiment or an observational 
study—and emulation procedures 
when using observational data. 


This book's authors and their col- 
laborators have followed a similar 
procedure to estimate the effect of 
weight loss using observational data 
(see, for example, Danaei et al, 
2016). We tried to carefully define 
the timing of the treatment strate- 
gies under the assumption that the 
method used to lose weight was ir- 
relevant. 
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In this Section and throughout the book, the term causal effect refers to a 
contrast between average counterfactual outcomes under different treatment 
values. Therefore, for each causal effect, we can imagine a (hypothetical) ran- 
domized experiment to quantify it. We refer to that hypothetical experiment 
as the target experiment or the target trial. When conducting the target trial 
is not feasible, ethical, or timely, we resort to causal analyses of observational 
data. That is, causal inference from observational data can be viewed as an 
attempt to emulate the target trial. If the emulation is successful, there is no 
difference between the observational estimates and the numerical results that 
the target trial would have yielded (had it been conducted). As we said in 
Section 3.1, if the analogy between observational study and a conditionally 
randomized experiment happens to be correct in our data, then we can use the 
methods described in the previous chapter—IP weighting or standardization— 
to compute causal effects from observational studies. (See Fine Point 3.4 for 
how to use observational data to compute the proportion of cases attributable 
to treatment.) 

Therefore “what randomized experiment are you trying to emulate?” is 
a key question for causal inference from observational data. For each causal 
effect that we wish to estimate using observational data, we can describe (i) 
the target trial that we would like to, but cannot, conduct, and (ii) how the 
observational data can be used to emulate that target trial. 

Describing the target trial can be done by specifying the key components 
of its protocol: eligibility criteria, interventions (or treatment strategies), out- 
come, follow-up, causal contrast, and statistical analysis. Here we focus on the 
treatment strategies or, in the language of this chapter, the interventions that 
will be compared across groups. As discussed in the previous two sections, 
investigators will first specify the interventions of interest and then identify 
individuals who receive them in the data. 

Consider the causal effect of “weight loss” on mortality in individuals who 
are obese and do not smoke at age 40. The first step for investigators is to 
make their causal question less vague. For example, they might agree that 
their goal is estimating the effect of losing 5% of body mass index every year, 
starting at age 40 and for as long as their body mass index stays over 25, under 
the assumption that it does not matter how the weight loss is achieved. They 
can now transfer this treatment strategy to the protocol of a target trial which 
they will attempt to emulate with the data at their disposal. 

An explicit emulation of the target trial prevents investigators from con- 
ducting an oversimplified analysis that compares the risk of death in, say, obese 
versus nonobese individuals at age 40. That comparison corresponds implic- 
itly to a target trial in which obese individuals are instantaneously transformed 
into individuals with a body mass index of 25 at baseline (through a massive 
liposuction?). Such target trial cannot be emulated because very few people, 
if anyone, in the real world undergo such instantaneous change, and thus the 
counterfactual outcomes cannot be linked to the observed outcomes. 

The conceptualization of causal inference from observational data as an 
attempt to emulate a target trial is not universally accepted. Some authors 
presuppose that “the average causal effect of A on Y” is a well-defined quan- 
tity, no matter what A and Y stand for (as long as A temporally precedes 
Y). For example, when considering the effect of obesity, they claim that it 
is not necessary to carefully specify the target trial. In contrast to our view 
that specifying the target trial is necessary for interpreting numerical effect es- 
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Fine Point 3.4 


Attributable fraction. We have described effect measures like the causal risk ratio Pr[Y°=! = 1]/ Pr[Y °=? = 1] and 
the causal risk difference Pr[Y°=! = 1] — Pr[Y*=° = 1], which compare the counterfactual risk under treatment a = 1 
with the counterfactual risk under treatment a = 0. However, one could also be interested in measures that compare 
the observed risk with the counterfactual risk under either treatment a = 1 or a = 0. This latter contrast allows us 
to compute the proportion of cases that are attributable to treatment in an observational study, i.e., the proportion of 
cases that would not have occurred had treatment not occurred. For example, suppose that all 20 individuals in our 
population attended a dinner in which they were served either ambrosia (A = 1) or nectar (A = 0). The following day, 
7 of the 10 individuals who received A = 1, and 1 of the 10 individuals who received A = 0, were sick. For simplicity, 
assume exchangeability of the treated and the untreated so that the causal risk ratio is 0.7/0.1 = 7 and the causal 
risk difference is 0.7 — 0.1 = 0.6. (In conditionally randomized experiments, one would compute these effect measures 
via standardization or IP weighting.) It was later discovered that the ambrosia had been contaminated by a flock of 
doves, which explains the increased risk summarized by both the causal risk ratio and the causal risk difference. We 
now address the question ‘what fraction of the cases was attributable to consuming ambrosia?’ 

In this study we observed 8 cases, i.e., the observed risk was Pr[Y = 1] = 8/20 = 0.4. The risk that would 
have been observed if everybody had received a = 0 is Pr[Y“=° = 1] = 0.1. The difference between these two risks 
is 0.4 — 0.1 = 0.3. That is, there is an excess 30% of the individuals who did fall ill but would not have fallen ill if 
everybody in the population had received a = 0 rather than their treatment A. Because 0.3/0.4 = 0.75, we say that 
75% of the cases are attributable to treatment a = 1: compared with the 8 observed cases, only 2 cases would have 
occurred if everybody had received a = 0. This excess fraction or attributable fraction is defined as 


Pr [Y = 1] — Pr[Y °= = 1] 
Pr[Y = 1] 


See Fine Point 5.4 for a discussion of the excess fraction in the context of the sufficient-component-cause framework. 

The excess fraction is generally different from the etiologic fraction, another version of the attributable fraction 
which is defined as the proportion of cases mechanically caused by exposure. For example, suppose the untreated 
(A = 0) would have had 7 cases if they have been treated, but these 7 cases would not have contained the 1 untreated 
case that actually occurred, i.e., treatment produces 7 cases but prevents 1 case. Also suppose that, if untreated, the 
treated would have had only 1 case but different from the 7 cases they actually had. Then the excess fraction would 
not be equal to the etiologic fraction. Here the excess fraction is a lower bound on the etiologic fraction. Because 
the etiologic fraction does not rely on the concept of excess cases, it can only be computed in randomized experiments 
under strong assumptions (Greenland and Robins, 1988). 


timates, these authors question the need for such quantitative interpretation. 

For some examples of this point of Their argument goes like this: 

view, see Pearl (2009), Schwartz 

et al (2016), and Glymour and We may not precisely know which particular causal effect is 

Spiegelman (2016). being estimated in an observational study, but is that really so 
important if indeed some causal effect exists? A strong association 
between obesity and mortality may imply that there exists some 
intervention on body weight that reduces mortality. There is value 
in learning that many deaths could have been prevented if all obese 
people had been forced, somehow, to be of normal weight, even 
if the intervention required for achieving that transformation is 
unspecified. 


This is an appealing, but risky, argument. Accepting it raises an important 
problem: Ill-defined versions of treatment prevent a proper consideration of 
exchangeability and positivity in observational studies. 


3.6 The target trial 


Extreme interventions are more 
likely to go unrecognized when they 
are not explicitly specified. 


For an extended discussion about 
the differences between prediction 
and causal inference, which is a 
form of counterfactual prediction, 
see Hernán, Hsu, and Healy (2019). 
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Let us talk about exchangeability first. To correctly emulate the target 
trial, investigators need to emulate randomization itself, which is tantamount 
to achieving exchangeability of the treated and the untreated, possibly condi- 
tional on covariates L. If we forgo characterizing the treatment version corre- 
sponding to our causal question about obesity, how can we even try to identify 
and measure the covariates L that make obese and nonobese individuals condi- 
tionally exchangeable, i.e., covariates L that are determinants of the versions of 
treatment (obesity) and also risk factors for the outcome (mortality)? When 
trying to estimate the effect of an unspecified treatment version, the usual 
uncertainty regarding conditional exchangeability is greatly exacerbated. 

The acceptance of unspecified versions of treatment also affects positivity. 
Suppose we decide to compute the effect of obesity on mortality by adjusting 
for covariates L that include diet and exercise. It is possible that, for some 
values of these variables, no individual will be obese; that is, positivity does 
not hold. If enough biologic knowledge is available, one could preserve pos- 
itivity by restricting the analysis to the strata of L in which the population 
contains both obese and nonobese individuals, but these strata may be no 
longer representative of the original population. 

Positivity violations point to another potential problem: unspecified ver- 
sions of treatment may correspond to a target trial that implements unreason- 
able interventions. The apparently straightforward comparison of obese and 
nonobese individuals in observational studies masks the true complexity of in- 
terventions such as ‘make everybody in the population instantly nonobese.’ 
Had these interventions been made explicit, investigators would have realized 
that these drastic changes are unlikely to be observed in the real world, and 
therefore they are irrelevant for anyone considering weight loss. As discussed 
above, a more reasonable, even if not completely well-defined, intervention may 
be to reduce body mass index by 5% annually. Anchoring causal inferences to 
a target trial not only helps sharpen the specification of the causal question in 
observational analyses, but also makes the inferences more relevant for decision 
making. 

The problems generated by unspecified treatments cannot be dealt with 
by applying sophisticated statistical methods. All analytic methods for causal 
inference from observational data described in this book yield effect estimates 
that are only as well defined as the treatments that are being compared. Al- 
though the exchangeability condition can be replaced by other unverifiable 
conditions (see Chapter 16) and the positivity condition can be waived if one 
is willing to make untestable extrapolations via modeling (Chapter 14), the 
requirement of sufficiently well-defined treatments is so fundamental that it 
cannot be waived without simultaneously negating the possibility of describ- 
ing the causal effect that is being estimated. 

Is everything lost when the observational data cannot be used to emulate 
an interesting target trial? Not really. Observational data may still be quite 
useful by focusing on non-causal prediction, for which the concept of target 
trial does not apply. That obese individuals have a higher mortality risk than 
nonobese individuals means that obesity is a predictor of—is associated with— 
mortality. This is an important piece of information to identify individuals at 
high risk of mortality. Note, however, that by simply saying that obesity 
predicts—is associated with—mortality, we remain agnostic about the causal 
effects of obesity on mortality: obesity might predict mortality in the sense 
that carrying a lighter predicts lung cancer. Thus the association between 
obesity and mortality is an interesting hypothesis-generating exercise and a 
motivation for further research (why does obesity predict mortality anyway’), 
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Technical Point 3.2 


Cheating consistency. Consider a compound treatment R with multiple, relevant versions of treatment. Interestingly, 
even if the versions of treatment are not well defined, we may still articulate a consistency condition that is guaranteed 
to hold (Hernán and VanderWeele, 2011; VanderWeele and Hernán, 2013): For individuals with R; = r we let A;(r) 
denote the version of treatment R; = r actually received by individual 2; for individuals with R; 4 r we define A;(r) = 0 
so that A;(r) € {0} U A(r). The consistency condition then requires for all 2, 


YS yr") when R; =r and A;(r) = a(r). 


That is, the outcome for every individual who received a particular version of treatment R = r equals his outcome 
if he had received that particular version of treatment. This statement is true by definition of version of treatment if 
we in fact define the counterfactual yr) for individual 7 with R; = r and A;(r) = a(r) as individual i's outcome 
that he actually had under actual treatment r and actual version a(r). However, using this consistency condition is 
self-defeating because, as discussed in the main text, it prevents us from understanding what effect is being estimated 
and from being able to evaluate exchangeability and positivity. 

Similarly, consider the following hypothetical intervention: ‘assign everybody to being nonobese by changing the 
determinants of body weight to reflect the distribution of those determinants in those who are nonobese in the study 
population.’ This intervention would randomly assign a version of treatment to each individual in the study population 
so that the resulting distribution of versions of treatment exactly matches the distribution of versions of treatment in 
the study population. Analogously, we can propose another hypothetical, random intervention that assigns everybody 
to being obese. 

This trick is implicitly used in the analysis of many observational studies that compare the risks Pr[Y = 1|A = 1] 
and Pr[Y = 1|A = 0] (often conditional on other variables) to endow the contrast with a causal interpretation. A 
problem with this trick is, of course, that the proposed random interventions may not match any realistic interventions 
we are interested in. Learning that intervening on ‘the determinants of body weight to reflect the distribution of 
those determinants in those with nonobese weight’ decreases mortality by, say, 30% does not imply that realistic 
interventions (e.g., modifying caloric intake or exercise levels) will decrease mortality by 30% too. In fact, if intervening 
on ‘determinants of body weight in the population’ requires intervening on genetic factors, then a 30% reduction in 
mortality may be unattainable by interventions that can actually be implemented in the real world. 


but not necessarily an appropriate justification to recommend a weight loss 
intervention targeted to the entire population. 

By retreating into prediction from observational data, we avoid tackling 
questions that cannot be logically asked in randomized experiments, not even 
in principle. On the other hand, when causal inference is the ultimate goal, 
prediction may be unsatisfying. 


Chapter 4 
EFFECT MODIFICATION 


So far we have focused on the average causal effect in an entire population of interest. However, many causal 
questions are about subsets of the population. Consider again the causal question “does one’s looking up at 
the sky make other pedestrians look up too?” You might be interested in computing the average causal effect of 
treatment—your looking up to the sky— in city dwellers and visitors separately, rather than the average effect in 
the entire population of pedestrians. 

The decision whether to compute average effects in the entire population or in a subset depends on the 
inferential goals. In some cases, you may not care about the variations of the effect across different groups of 
individuals. For example, suppose you are a policy maker considering the possibility of implementing a nationwide 
water fluoridation program. Because this public health intervention will reach all households in the population, 
your primary interest is in the average causal effect in the entire population, rather than in particular subsets. 
You will be interested in characterizing how the causal effect varies across subsets of the population when the 
intervention can be targeted to different subsets, or when the findings of the study need to be applied to other 
populations. 

This chapter emphasizes that there is not such a thing as the causal effect of treatment. Rather, the causal 
effect depends on the characteristics of the particular population under study. 


4.1 Definition of effect modification 





Table 4.1 We started this book by computing the average causal effect of heart trans- 

= = oe y y plant A on death Y in a population of 20 members of Zeus’s extended family. 
Rheia EO I We used the data in Table 1.1, whose columns show the individual values 
Demeter 1 0 0 of the (generally unobserved) counterfactual outcomes Y°=? and Y@=!. Af- 
Hestia 1 0 0 ter examining the data in Table 1.1, we concluded that the average causal 
Hera 1 0 0 effect was null. Half of the members of the population would have died if 
Artemis p. i 1 everybody had received a heart transplant, Pr[Y°=} = 1] = 10/20 = 0.5, 
Late 1 0 1 and half of the members of the population would have died if nobody had re- 
Athena i 1l 1 ceived a heart transplant, Pr[Y¢=° = 1] = 10/20 = 0.5. The causal risk ratio 
Aphrodite 1 0 1 Pr[Y*=! = 1]/Pr[y*=° = 1] was 0.5/0.5 = 1 and the causal risk difference 
Persephone 1 1 1 Pr[Y °=! = 1] — Pr[Y °=? = 1] was 0.5 — 0.5 = 0. 
Hebe i 3 0 We now consider two new causal questions: What is the average causal 
Kronos 0 1 0 effect of A on Y in women? And in men? To answer these questions we 
Hades 0 0 0 will use Table 4.1, which contains the same information as Table 1.1 plus an 
Poseidon 0 1 0 additional column with an indicator V for sex: V = 1 for females (referred 
Tae 0 0 1 to as women in this book) and V = 0 for males (referred to as men). For 
Apollo 0 1 0 convenience, we have rearranged the table so that women occupy the first 10 
Ares 0 1 1 rows, and men the last 10 rows. 
Hephaestus 0 0 1 Let us first compute the average causal effect in women. To do so, we need 
Cyclope 0o 0 1 to restrict the analysis to the first 10 rows of the table with V = 1. In this 
Warnes 0 1 0 subset of the population, the risk of death under treatment is Pr[Y °=! = 1|V = 
Dionysus 0 1 0 1] = 6/10 = 0.6 and the risk of death under no treatment is Pr[Y¢=° = 1|V = 





1] = 4/10 = 0.4. The causal risk ratio is 0.6/0.4 = 1.5 and the causal risk 
difference is 0.6 — 0.4 = 0.2. That is, on average, heart transplant A increases 
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See Section 6.5 for a structural clas- 
sification of effect modifiers. 


Additive effect modification: 
E[y °=! — Y°=0|V = 1] # 
E[y °=! — Y°=0|V = 0] 


Multiplicative effect modification: 
E[Y =V =1] 4 E[Y°*="|V=0] 
E[Ye=|V=1] 7 E[ye="|V=0] 








We do not consider effect modifica- 
tion on the odds ratio scale because 
the odds ratio is rarely, if ever, the 
parameter of interest for causal in- 
ference. 


Multiplicative, but not additive, ef- 
fect modification by V: 


Pr[Y2=° = 1|V = 1] = 0.8 
Pr[y@=! = 1|V = 1] = 0.9 
Pr[y¢=° = 1|V =0] = 0.1 
Pr[y@=! = 1|V = 0] = 0.2 


Effect modification 


the risk of death Y in women. 


Let us next compute the average causal effect in men. To do so, we need to 
restrict the analysis to the last 10 rows of the table with V = 0. In this subset 
of the population, the risk of death under treatment is Pr[Y°=! = 1|V = 0] = 
4/10 = 0.4 and the risk of death under no treatment is Pr[Y°=° = 1|V = 0] = 
6/10 = 0.6. The causal risk ratio is 0.4/0.6 = 2/3 and the causal risk difference 
is 0.4 — 0.6 = —0.2. That is, on average, heart transplant A decreases the risk 
of death Y in men. 


Our example shows that a null average causal effect in the population does 
not imply a null average causal effect in a particular subset of the population. 
In Table 4.1, the null hypothesis of no average causal effect is true for the 
entire population, but not for men or women when taken separately. It just 
happens that the average causal effects in men and in women are of equal 
magnitude but in opposite direction. Because the proportion of each sex is 
50%, both effects cancel out exactly when considering the entire population. 
Although exact cancellation of effects is probably rare, heterogeneity of the 
individual causal effects of treatment is often expected because of variations in 
individual susceptibilities to treatment. An exception occurs when the sharp 
null hypothesis of no causal effect is true. Then no heterogeneity of effects 
exists because the effect is null for every individual and thus the average causal 
effect in any subset of the population is also null. 


We are now ready to provide a definition of effect modifier. We say that V 
is a modifier of the effect of A on Y when the average causal effect of A on Y 
varies across levels of V. Since the average causal effect can be measured using 
different effect measures (e.g., risk difference, risk ratio), the presence of effect 
modification depends on the effect measure being used. For example, sex V 
is an effect modifier of the effect of heart transplant A on mortality Y on the 
additive scale because the causal risk difference varies across levels of V. Sex 
V is also an effect modifier of the effect of heart transplant A on mortality Y 
on the multiplicative scale because the causal risk ratio varies across levels of 
V. We only consider variables V that are not affected by treatment A as effect 
modifiers. 


In Table 4.1 the causal risk ratio is greater than 1 in women (V = 1) and 
less thanl in men (V = 0). Similarly, the causal risk difference is greater 
than 0 in women (V = 1) and less than0 in men (V = 0). That is, there is 
qualitative effect modification because the average causal effects in the subsets 
V = 1 and V = 0 are in the opposite direction. In the presence of qualitative 
effect modification, additive effect modification implies multiplicative effect 
modification, and vice versa. In the absence of qualitative effect modification, 
however, one can find effect modification on one scale (e.g., multiplicative) but 
not on the other (e.g., additive). To illustrate this point, suppose that, in a 
second study, we computed the quantities shown to the left of this line. In 
this study, there is no additive effect modification by V because the causal 
risk difference among individuals with V = 1 equals that among individuals 
with V = 0, i.e., 0.9 — 0.8 = 0.1 = 0.2 — 0.1. However, in this study there 
is multiplicative effect modification by V because the causal risk ratio among 
individuals with V = 1 differs from that among individuals with V = 0, that 
is, 0.9/0.8 = 1.1 4 0.2/0.1 = 2. Since one cannot generally state that there is, 
or there is not, effect modification without referring to the effect measure being 
used (e.g., risk difference, risk ratio), some authors use the term effect-measure 
modification, rather than effect modification, to emphasize the dependence of 
the concept on the choice of effect measure. 
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4.2 Stratification to identify effect modification 


Stratification: the causal effect of 
A on Y is computed in each stra- 
tum of V. For dichotomous V, the 
stratified causal risk differences are: 


Pr[Y °=! = 1|V = 1]— 
Pry = 1|V = 1] 
and 

Pr[Y °=! = 1|V = 0]— 
Pr[Y °= = 1|V = 0] 


Table 4.2 
Stratum V = 0 





Cybele 
Saturn 
Ceres 
Pluto 
Vesta 
Neptune 
Juno 
Jupiter 
Diana 
Phoebus 
Latona 
Mars 
Minerva 
Vulcan 
Venus 
Seneca 
Proserpina 
Mercury 
Juventas 
Bacchus 


PREPRPRPRPRPrPrPPHFPHrPEFEODOOCCGCCOO|W 
PPP RP RPP RPE HEODORPHFPHEHEOOOCO OR 


SOOFPRPHFPHPHEPHOFRORFHFOCOCOCOFRO|K< 


A stratified analysis is the natural way to identify effect modification. To 
determine whether V modifies the causal effect of A on Y, one computes the 
causal effect of A on Y in each level (stratum) of the variable V. In the 
previous section, we used the data in Table 4.1 to compute the causal effect 
of transplant A on death Y in each of the two strata of sex V. Because 
the causal effect differed between the two strata (on both the additive and the 
multiplicative scale), we concluded that there was (additive and multiplicative) 
effect modification by V of the causal effect of A on Y. 

But the data in Table 4.1 are not the typical data one encounters in real 
life. Instead of the two columns with each individual’s counterfactual outcomes 
Y=! and Y*~°, one will find two columns with each individual’s treatment 
level A and observed outcome Y. How does the unavailability of the counter- 
factual outcomes affect the use of stratification to detect effect modification? 
The answer depends on the study design. 

Consider first an ideal marginally randomized experiment. In Chapter 2 
we demonstrated that, leaving aside random variability, the average causal ef- 
fect of treatment can be computed using the observed data. For example, the 
causal risk difference Pr[¥¢=! = 1] — Pr[Y °=? = 1] is equal to the observed 
associational risk difference Pr[Y = 1|A = 1] — Pr[Y = 1|A = 0]. The same 
reasoning can be extended to each stratum of the variable V because, if treat- 
ment assignment was random and unconditional, exchangeability is expected 
in every subset of the population. Thus the causal risk difference in women, 
Pr[Y¥2=! = 1|V = 1] — Pr[Y °= = 1|V = 1], is equal to the associational risk 
difference in women, Pr[Y = 1|A =1,V = 1] — Pr[Y = 1|A=0,V = 1]. And 
similarly for men. Thus, to identify effect modification by V in an ideal exper- 
iment with unconditional randomization, one just needs to conduct a stratified 
analysis, that is, to compute the association measure in each level of the vari- 
able V. Stratification can be used to compute average causal effects in subsets 
of the population, but not individual effects (see Fine Points 2.1 and 3.2). 

Consider now an ideal randomized experiment with conditional randomiza- 
tion. In a population of 40 people, transplant A has been randomly assigned 
with probability 0.75 to those in severe condition (L = 1), and with probabil- 
ity 0.50 to the others (L = 0). The 40 individuals can be classified into two 
nationalities according to their passports: 20 are Greek (V = 1) and 20 are 
Roman (V = 0). The data on L, A, and death Y for the 20 Greeks are shown 
in Table 2.2 (same as Table 3.1). The data for the 20 Romans are shown in 
Table 4.2. The population risk under treatment, Pr[Y°=! = 1], is 0.55, and 
the population risk under no treatment, Pr[Y*=° = 1], is 0.40. (Both risks 
are readily calculated by using either standardization or IP weighting. We 
leave the details to the reader.) The average causal effect of transplant A 
on death Y is therefore 0.55 — 0.40 = 0.15 on the risk difference scale, and 
0.55/0.40 = 1.375 on the risk ratio scale. In this population, heart transplant 
increases the mortality risk. 

As discussed in the previous chapter, the calculation of the causal effect 
would have been the same if the data had arisen from an observational study 
in which we believe that conditional exchangeability Y*1LA|L holds. 

We now discuss how to conduct a stratified analysis to investigate whether 
nationality V modifies the effect of A on Y. The goal is to compute the causal 
effect of A on Y in the Greeks, Pr[Y?=! = 1|V = 1]—Pr[Y*~® = 1|V = 1], and 
in the Romans, Pr[Y@=! = 1|V = 0] — Pr[Y °=? = 1|V = 0]. If these two causal 
risk differences differ, we will say that there is additive effect modification by 
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Fine Point 4.1 


Effect in the treated. This chapter is concerned with average causal effects in subsets of the population. One particular 
subset is the treated (A = 1). The average causal effect in the treated is not null if Pr[Y¢=! = 1|A = 1] 4 Pr[Y*? = 
1|A = 1] or, by consistency, if 

Pr[Y SAH 1] # Pry = 1|A = 1]. 


That is, there is a causal effect in the treated if the observed risk among the treated individuals does not equal the 
counterfactual risk had the treated individuals been untreated. The causal risk difference in the treated is Pr[Y = 1|A = 
1] — Pr[Y °=? = 1|A = 1]. The causal risk ratio in the treated, also known as the standardized morbidity ratio (SMR), 
is Pr[Y = 1|A = 1]/ Pr[Y*=® = 1|A = 1]. The causal risk difference and risk ratio in the untreated are analogously 
defined by replacing A = 1 by A = 0. Figure 4.1 shows the groups that are compared when computing the effect in the 
treated and the effect in the untreated. 

The average effect in the treated will differ from the average effect in the population if the distribution of individual 
causal effects varies between the treated and the untreated. That is, when computing the effect in the treated, treatment 
group A = 1 is used as a marker for the factors that are truly responsible for the modification of the effect between 
the treated and the untreated groups. However, even though one could say that there is effect modification by the 
pretreatment variable V even if V is only a surrogate (e.g., nationality) for the causal effect modifiers, one would not 
say that there is modification of the effect A by treatment A because it sounds confusing. 

See Section 6.6 for a graphical representation of true and surrogate effect modifiers. The bulk of this book is 
focused on the causal effect in the population because the causal effect in the treated, or in the untreated, cannot be 
directly generalized to time-varying treatments (see Part III). 





V. And similarly for the causal risk ratios if interested in multiplicative effect 
modification. 

The procedure to compute the conditional risks Pr[Y°=! = 1|V = v] and 
Pr[Y*=° = 1|V = v] in each stratum v has two stages: 1) stratification by 
V, and 2) standardization by L (or, equivalently, IP weighting with weights 

Step 2 can be ignored when V is depending on L). We computed the standardized risks in the Greek stratum 

equal to the variables L that are (V = 1) in Chapter 2: the causal risk difference was 0 and the causal risk 

needed for conditional exchange- ratio was 1. Using the same procedure in the Roman stratum (V = 0), we can 

ability (see Section 4.4). compute the risks Pr[Y°=! = 1|V = 0] = 0.6 and Pr[Y*~® = 1|V = 0] = 0.3. 
(Again, we leave the details to the reader.) Therefore, the causal risk difference 
is 0.3 and the causal risk ratio is 2 in the stratum V = 0. Because these effect 
measures differ from those in the stratum V = 1, we say that there is both 
additive and multiplicative effect modification by nationality V of the effect of 
transplant A on death Y. This effect modification is not qualitative because 
the effect is harmful or null in both strata V = 0 and V = 1. 

We have shown that, in our study population, nationality V modifies the 
effect of heart transplant A on the risk of death Y. However, we have made no 
claims about the causal mechanisms involved in such effect modification. In 
fact, it is possible that nationality is simply a marker for the causal factor that 
is truly responsible for the modification of the effect. For example, suppose 
that the quality of heart surgery is better in Greece than in Rome. One would 
then find effect modification by nationality. An intervention to improve the 
quality of heart surgery in Rome could eliminate the modification of the causal 

See Section 6.6 for a graphical rep- effect by passport-defined nationality. Whenever we want to emphasize this 
resentation of surrogate and causal distinction, we will refer to nationality as a surrogate effect modifier, and to 
effect modifiers. quality of care as a causal effect modifier. 

Therefore, our use of the term effect modification by V does not necessarily 
imply that V plays a causal role in the modification of the effect. To avoid 
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Figure 4.1 


potential confusions, some authors prefer to use the more neutral term “effect 
heterogeneity across strata of V” rather than “effect modification by V.” The 
next chapter introduces “interaction,” a concept related to effect modification, 
that does attribute a causal role to the variables involved. 


Population of interest 


Treated Untreated 
Effect in the treated Effect in the untreated 
E[A=1] E[Y™ 4 =1] E[Y'4 =0] [yd =0] 


4.3 Why care about effect modification 


There are several related reasons why investigators are interested in identifying 
effect modification, and why it is important to collect data on pre-treatment 
descriptors V even in randomized experiments. 

First, if a factor V modifies the effect of treatment A on the outcome Y 
then the average causal effect will differ between populations with different 
prevalence of V. For example, the average causal effect in the population of 
Table 4.1 is harmful in women and beneficial in men, that is, there is qualita- 
tive effect modification. Because there are 50% of individuals of each sex and 
the sex-specific harmful and beneficial effects are equal but of opposite sign, 
the average causal effect in the entire population is null. However, had we 
conducted our study in a population with a greater proportion of women (e.g., 
graduating college students), the average causal effect in the entire population 
would have been harmful. In the presence of non-qualitative effect modifica- 
tion, the magnitude, but not the direction, of the average causal effect may 
vary across populations. As examples of non-qualitative effect modification, 
consider the effects of asbestos exposure (which differ between smokers and 
nonsmokers) and of universal health care (which differ between low-income 
and high-income families). 

That is, the average causal effect in a population depends on the distribu- 
tion of individual causal effects in the population. There is generally no such 
a thing as “the average causal effect of treatment A on outcome Y (period)”, 
but “the average causal effect of treatment A on outcome Y in a population 
with a particular mix of causal effect modifiers.” 
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Technical Point 4.1 


Computing the effect in the treated. We computed the average causal effect in the population under conditional 
exchangeability Y°1LA|L for both a = 0 and a = 1. Computing the average causal effect in the treated only requires 
partial exchangeability Y°=° | A|L. In other words, it is irrelevant whether the risk in the untreated, had they been 
treated, equals the risk in those who were actually treated. The average causal effect in the untreated is computed 
under the partial exchangeability condition Y°=! IL A|L. 

We now describe how to compute the counterfactual mean E[Y°|.A = a’] via standardization, and via IP weighting, 
under the above assumptions of partial exchangeability: 


e Standardization: E[Y*|A = a’] is equal to X E[Y]|A = a, L = l] Pr [L = I| A = a']. See Miettinen (1972) and 
i 


Greenland and Rothman (2008) for a discussion of standardized risk ratios. 


e IP weighting: E[Y°|A = a’] is equal to the IP weighted mean 


Pr[A=a'|L] 
f (AJL) 


I(A=a)Y A= a 

=| a l See 
I(A=a ; 

Bay Pel =e] 


. For dichotomous A, this equality was derived by Sato and Matsuyama (2003). See Hernán and 


Robins (2006) for further details. 





Some refer to lack of transportabil- 
ity as lack of external validity. 


A setting in which transportabil- 
ity may not be an issue: Smith 
and Pell (2003) could not iden- 
tify any major modifiers of the ef- 
fect of parachute use on death af- 
ter “gravitational challenge’ (e.g., 
jumping from an airplane at high al- 
titude). They concluded that con- 
ducting randomized trials of para- 
chute use restricted to a particu- 
lar group of people would not com- 
promise the transportability of the 
findings to other groups. 


The extrapolation of causal effects computed in one population to a second 
population is referred to as transportability of causal inferences across popula- 
tions (see Fine Point 4.2). In our example, the causal effect of heart transplant 
A on risk of death Y differs between men and women, and between Romans 
and Greeks. Thus the average causal effect in this population may not be trans- 
portable to other populations with a different distribution of effect modifiers 
such as sex and nationality. 

Conditional causal effects in the strata defined by the effect modifiers may 
be more transportable than the causal effect in the entire population, but 
there is no guarantee that the conditional effect measures in one population 
equal the conditional effect measures in another population. This is so be- 
cause there could be other unmeasured, or unknown, causal effect modifiers 
whose conditional distributions vary between the two populations (or for other 
reasons described in Fine Point 4.2). These unmeasured effect modifiers are 
not variables needed to achieve exchangeability, but just risk factors for the 
outcome. Therefore, transportability of effects across populations is a more 
difficult problem than the identification of causal effects in a single population: 
one would need to stratify not just on all those things required to achieve ex- 
changeability (which you might have information about, say, by interviewing 
those who decide how to allocate the treatment) but on unmeasured causes of 
the outcome for which there is much less information. 

Hence, transportability of causal effects is an unverifiable assumption that 
relies heavily on subject-matter knowledge. For example, most experts would 
agree that the health effects (on either the additive or multiplicative scale) of 
increasing a household’s annual income by $100 in Niger cannot be transported 
to the Netherlands, but most experts would agree that the health effects of use 
of cholesterol-lowering drugs in Europeans can be transported to Canadians. 


Second, evaluating the presence of effect modification is helpful to identify 
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Several authors (e.g., Blot and 
Day, 1979; Rothman et al., 1980; 
Saracci, 1980) have referred to ad- 
ditive effect modification as the one 
of interest for public health pur- 
poses. 


the groups of individuals that would benefit most from an intervention. In our 
example of Table 4.1, the average causal effect of treatment A on outcome Y 
was null. However, treatment A had a beneficial effect in men (V = 0), and a 
harmful effect in women (V = 1). If a physician knew that there is qualitative 
effect modification by sex then, in the absence of additional information, she 
would treat the next patient only if he happens to be a man. The situation is 
slightly more complicated when, as in our second example, there is multiplica- 
tive, but not additive, effect modification. Here treatment reduces the risk of 
the outcome by 10% in individuals with V = 0 and also by 10% in individuals 
with V = 1, i.e., there is no additive effect modification by V because the 
causal risk difference is 0.1 in all levels of V. Thus, an intervention to treat all 
patients would be equally effective in reducing risk in both strata of V, despite 
the fact that there is multiplicative effect modification. In fact, if there is a 
nonzero causal effect in at least one stratum of V and the counterfactual risk 
Pr[Y*=° = 1|V = v] varies with v, then effect modification is guaranteed on 
either the additive or the multiplicative scale. 

Additive, but not multiplicative, effect modification is the appropriate scale 
to identify the groups that will benefit most from intervention. In the absence 
of additive effect modification, it is usually not very helpful to learn that there 
is multiplicative effect modification. 

In our second example, the presence of multiplicative effect modification 
follows from the mathematical fact that, because the risk under no treatment 
in the stratum V = 1 equals 0.8, the maximum possible causal risk ratio in the 
V = 1 stratum is 1/0.8 = 1.25. Thus the causal risk ratio in the stratum V = 1 
is guaranteed to differ from the causal risk ratio of 2 in the V = 0 stratum. In 
these situations, the presence of multiplicative effect modification is simply the 
consequence of different risk under no treatment Pr[Y?=° = 1|V = v] across 
levels of V. Therefore, as a general rule, it is more informative to report the 
(absolute) counterfactual risks Pr[Y¢=! = 1|V = v] and Pr[Y2=° = 1|V = v] 
in every level v of V, rather than simply their ratio or difference. 

Finally, the identification of effect modification may help understand the 
biological, social, or other mechanisms leading to the outcome. For example, 
a greater risk of HIV infection in uncircumcised compared with circumcised 
men may provide new clues to understand the disease. The identification of 
effect modification may be a first step towards characterizing the interactions 
between two treatments. The terms “effect modification” and “interaction” 
are sometimes used as synonymous in the scientific literature. This chapter 
focused on “effect modification.” The next chapter describes “interaction” as 
a causal concept that is related to, but different from, effect modification. 


4.4 Stratification as a form of adjustment 


Until this chapter, our only goal was to compute the average causal effect in 
the entire population. In the absence of marginal randomization, achieving 
this goal requires adjustment for the variables L that ensure conditional ex- 
changeability of the treated and the untreated. For example, in Chapter 2 we 
determined that the average causal effect of heart transplant A on mortality Y 
was null, that is, the causal risk ratio Pr [Y°=! = 1] / Pr [Y*-° = 1] = 1. We 
used the data in Table 2.2 to adjust for the factor L via both standardization 
and IP weighting. 

The present chapter adds another potential goal to the analysis: to identify 
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Fine Point 4.2 


Transportability. Causal effects estimated in one population are often intended to make decisions in another population, 
which we will refer to as the target population. Suppose we have correctly estimated the average causal effect of 
treatment in our study population under exchangeability, positivity, and consistency. Will the effect be the same in the 
target population? That is, can we “transport” the effect from the study population to the target population? The 
answer to this question depends on the characteristics of both populations. Specifically, transportability of effects from 
one population to another may be justified if the following characteristics are similar between the two populations: 


e Effect modification: The causal effect of treatment may differ across individuals with different susceptibility to 
the outcome. For example, if women are more susceptible to the effects of treatment than men, we say that sex 
is an effect modifier. The distribution of effect modifiers in a population will generally affect the magnitude of 
the causal effect of treatment in that population. If the distribution of effect modifiers differ between the study 
population and the target population, then the magnitude of the causal effect of treatment will differ too. 


e Versions of treatment: The causal effect of treatment depends on the distribution of versions of treatment in the 
population. If this distribution differs between the study population and the target population, then the magnitude 
of the causal effect of treatment will differ too. 


e Interference: In the main text we have focused on settings with no interference (Fine Point 1.1). However, one 
must remember that interference may exist because treating one individual may affect the outcome of others in 
the population. For example, a socially active individual may convince his friends to Join him while exercising, and 
thus an intervention on that individual's physical activity may be more effective than an intervention on a socially 
isolated individual. Therefore, the patterns of contacts among individuals may affect the magnitude of the causal 
effect. If the contact patterns differ between the study population and the target population, then the magnitude 
of the causal effect of treatment will differ too. 


The transportability of causal inferences across populations may sometimes be improved by restricting our attention 
to the average causal effects in the strata defined by the effect modifiers, or by using the stratum-specific effects in 
the study population to reconstruct the average causal effect in the target population. For example, the four stratum- 
specific effect measures (Roman women, Greek women, Roman men, and Greek men) in our population can be combined 
in a weighted average to reconstruct the average causal effect in another population with a different mix of sex and 
nationality. The weight assigned to each stratum-specific measure is the proportion of individuals in that stratum in the 
second population. However, there is no guarantee that this reconstructed effect will coincide with the true effect in the 
target population because of possible between-population differences in the distribution of unmeasured effect modifiers, 
interference patterns, and distribution of versions of treatment. 


effect modification by variables V. To achieve this goal, we need to stratify 
by V before adjusting for L. For example, in this chapter we stratified by 
nationality V before adjusting for L to determine that the average causal effect 
of heart transplant A on mortality Y differed between Greeks and Romans. 
In summary, standardization (or IP weighting) is used to adjust for L and 
stratification is used to identify effect modification by V. 


But stratification is not always used to identify effect modification by V. 
In practice stratification is often used as an alternative to standardization (and 
IP weighting) to adjust for L. In fact, the use of stratification as a method 
to adjust for L is so widespread that many investigators consider the terms 
“stratification” and “adjustment” as synonymous. For example, suppose you 
ask an epidemiologist to adjust for the factor L to compute the effect of heart 
transplant A on mortality Y. Chances are that she will immediately split 
Table 2.2 into two subtables—one restricted to individuals with L = 0, the 
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Under conditional exchangeability 
given L, the risk ratio in the subset 
L = l measures the average causal 
effect in the subset L = | because, 
if Y° LAL, then 

Pr[Y = 1|A=a,L=0] 

Pr[Y* =1|L=0] 








Robins (1986, 1987) described the 
conditions under which stratum- 
specific effect measures for time- 
varying treatments will not have 
a causal interpretation even in the 
presence of exchangeability, positiv- 
ity, and well-defined interventions. 


Stratification requires positivity in 
addition to exchangeability: the 
causal effect cannot be computed 
in subsets L = 1 in which there are 
only treated, or untreated, individ- 
uals. 


other to individuals with L = 1—and would provide the effect measure (say, 
the risk ratio) in each of them. That is, she would calculate the risk ratios 
Pr[Y =1JA=1,L=1)/Pr[Y = 1|A = 0, L = l] = 1 for both l = 0 and l = 1. 

These two stratum-specific associational risk ratios can be endowed with a 
causal interpretation under conditional exchangeability given L: they measure 
the average causal effect in the subsets of the population defined by L = 0 
and L = 1, respectively. They are conditional effect measures. In contrast 
the risk ratio of 1 that we computed in Chapter 2 was a marginal (uncondi- 
tional) effect measure. In this particular example, all three risk ratios—the 
two conditional ones and the marginal one—happen to be equal because there 
is no effect modification by L. Stratification necessarily results in multiple 
stratum-specific effect measures (one per stratum defined by the variables L). 
Each of them quantifies the average causal effect in a nonoverlapping subset 
of the population but, in general, none of them quantifies the average causal 
effect in the entire population. Therefore, we did not consider stratification 
when describing methods to compute the average causal effect of treatment in 
the population in Chapter 2. Rather, we focused on standardization and IP 
weighting. 

In addition, unlike standardization and IP weighting, adjustment via strat- 
ification requires computing the effect measures in subsets of the population 
defined by a combination of all variables L that are required for conditional 
exchangeability. For example, when using stratification to estimate the effect 
of heart transplant in the population of Tables 2.2 and 4.2, one must compute 
the effect in Romans with L = 1, in Greeks with L = 1, in Romans with L = 0, 
and in Greeks with L = 0; but one cannot compute the effect in Romans by 
simply computing the association in the stratum V = 0 because nationality V, 
by itself, is insufficient to guarantee conditional exchangeability. 

That is, the use of stratification forces one to evaluate effect modification 
by all variables L required to achieve conditional exchangeability, regardless of 
whether one is interested in such effect modification. In contrast, stratification 
by V followed by IP weighting or standardization to adjust for L allows one 
to deal with exchangeability and effect modification separately, as described 
above. 

Other problems associated with the use of stratification are noncollapsi- 
bility of certain effect measures like the odds ratio (see Fine Point 4.3) and 
inappropriate adjustment that leads to bias when, in the case for time-varying 
treatments, it is necessary to adjust for time-varying variables L that are af- 
fected by prior treatment (see Part II). 

Sometimes investigators compute the causal effect in only some of the strata 
defined by the variables L. That is, no stratum-specific effect measure is com- 
puted for some strata. This form of stratification is known as restriction. 
For causal inference, stratification is simply the application of restriction to 
several comprehensive and mutually exclusive subsets of the population, with 
exchangeability within each of these subsets. When positivity fails in some 
strata of the population, restriction is used to limit causal inference to those 
strata of the original population in which positivity holds (see Chapter 3). 








4.5 Matching as another form of adjustment 


Matching is another adjustment method. The goal of matching is to construct a 
subset of the population in which the variables L have the same distribution in 
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Our discussion on matching applies 
to cohort studies only. In case- 
control designs (briefly discussed in 
Chapter 8), we often match cases 
and non-cases (i.e., controls) rather 
than the treated and the untreated. 
Even if the matching factors suf- 
fice for conditional exchangeabil- 
ity, matching in cases and controls 
does not achieve unconditional ex- 
changeability of the treated and the 
untreated in the matched popula- 
tion. Adjustment for the matching 
factors via stratification is required 
to estimate conditional (stratum- 
specific) effect measures. 


As the number of matching fac- 
tors increases, so does the proba- 
bility that no exact matches exist 
for an individual. There is a vast 
literature, beyond the scope of this 
book, on how to find approximate 
matches in those settings. 


Effect modification 


both the treated and the untreated. As an example, take our heart transplant 
example in Table 2.2 in which the variable L is sufficient to achieve conditional 
exchangeability. For each untreated individual in non critical condition (A = 
0, L = 0) randomly select a treated individual in non critical condition (A = 
1, L = 0), and for each untreated individual in critical condition (A = 0, L = 1) 
randomly select a treated individual in critical condition (A = 1,£ = 1). We 
refer to each untreated individual and her corresponding treated individual as a 
matched pair, and to the variable L as the matching factor. Suppose we formed 
the following 7 matched pairs: Rheia-Hestia, Kronos-Poseidon, Demeter-Hera, 
Hades-Zeus for L = 0, and Artemis-Ares, Apollo-Aphrodite, Leto-Hermes for 
L = 1. All the untreated, but only a sample of treated, in the population 
were selected. In this subset of the population comprised of matched pairs, the 
proportion of individuals in critical condition (L = 1) is the same, by design, 
in the treated and in the untreated (3/7). 

To construct our matched population we replaced the treated in the pop- 
ulation by a subset of the treated in which the matching factor L had the 
same distribution as that in the untreated. Under the assumption of condi- 
tional exchangeability given L, the result of this procedure is (unconditional) 
exchangeability of the treated and the untreated in the matched population. 
Because the treated and the untreated are exchangeable in the matched popu- 
lation, their average outcomes can be directly compared: the risk in the treated 
is 3/7, the risk in the untreated is 3/7, and hence the causal risk ratio is 1. Note 
that matching ensures positivity in the matched population because strata with 
only treated, or untreated, individuals are excluded from the analysis. 

Often one chooses the group with fewer individuals (the untreated in our 
example) and uses the other group (the treated in our example) to find their 
matches. The chosen group defines the subpopulation on which the causal 
effect is being computed. In the previous paragraph we computed the effect in 
the untreated. In settings with fewer treated than untreated individuals across 
all strata of L, we generally compute the effect in the treated. Also, matching 
needs not be one-to-one (matching pairs), but it can be one-to-many (matching 
sets). 

In many applications, L is a vector of several variables. Then, for each 
untreated individual in a given stratum defined by a combination of values of 
all the variables in L, we would have randomly selected one (or several) treated 
individual(s) from the same stratum. 

Matching can be used to create a matched population with any chosen 
distribution of L, not just the distribution in the treated or the untreated. The 
distribution of interest can be achieved by individual matching, as described 
above, or by frequency matching. An example of the latter is a study in which 
one randomly selects treated individuals in such a way that 70% of them have 
L = 1, and then repeats the same procedure for the untreated. 

Because the matched population is a subset of the original study population, 
the distribution of causal effect modifiers in the matched study population 
will generally differ from that in the original, unmatched study population, as 
discussed in the next section. 


4.6 Effect modification and adjustment methods 


Standardization, IP weighting, stratification/restriction, and matching are dif- 
ferent approaches to estimate average causal effects, but they estimate different 
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Technical Point 4.2 


Pooling of stratum-specific effect measures. So far we have focused on the conceptual, non statistical, aspects of 
causal inference by assuming that we work with the entire population rather than with a sample from it. Thus we talk 
about computing causal effects rather than about (consistently) estimating them. In the real world, however, we can 
rarely compute causal effects in the population. We need to estimate them from samples, and thus obtaining reasonably 
narrow confidence intervals around our estimated effect measures is an important practical concern. 

When dealing with stratum-specific effect measures, one commonly used strategy to reduce the variability of the 
estimates is to combine all stratum-specific effect measures into one pooled stratum-specific effect measure. The idea is 
that, if the effect measure is the same in all strata (i.e., if there is no effect-measure modification), then the pooled effect 
measure will be a more precise estimate of the common effect measure. Several methods (e.g., Woolf, Mantel-Haenszel, 
maximum likelihood) yield a pooled estimate, sometimes by computing a weighted average of the stratum-specific effect 
measures with weights chosen to reduce the variability of the pooled estimate. Greenland and Rothman (2008) review 
some commonly used methods for stratified analysis. Pooled effect measures can also be computed using regression 
models that include all possible product terms between all covariates L, but no product terms between treatment A 
and covariates L, i.e., models saturated (see Chapter 11) with respect to L. 

The main goal of pooling is to obtain a narrower confidence interval around the common stratum-specific effect 
measure, but the pooled effect measure is still a conditional effect measure. In our heart transplant example, the pooled 
stratum-specific risk ratio (Mantel-Haenszel method) was 0.88 for the outcome Z. This result is only meaningful if 
the stratum-specific risk ratios 2 and 0.5 are indeed estimates of the same stratum-specific causal effect. For example, 
suppose that the causal risk ratio is 0.9 in both strata but, because of the small sample size, we obtained estimates of 0.5 
and 2.0. In that case, pooling would be appropriate and the Mantel-Haenszel risk ratio would be closer to the truth than 
either of the stratum-specific risk ratios. Otherwise, if the causal stratum-specific risk ratios are truly 0.5 and 2.0, then 
pooling makes little sense and the Mantel-Haenszel risk ratio could not be easily interpreted. In practice, it is not always 
obvious to determine whether the heterogeneity of the effect measure across strata is due to sampling variability or to 
effect-measure modification. The finer the stratification, the greater the uncertainty introduced by random variability. 





types of causal effects. These four approaches can be divided into two groups 
according to the type of effect they estimate: standardization and IP weight- 
Table 4.3 ing can be used to compute either marginal or conditional effects, stratifica- 





L AZ tion/restriction and matching can only be used to compute conditional effects 
Rheia 0 0 0 in certain subsets of the population. All four approaches require exchangeabil- 
Kronos 0 0 1 ity and positivity but the subsets of the population in which these conditions 
Demeter 0 0 0 need to hold depend on the causal effect of interest. For example, to compute 
Hades 0 0 0 the conditional effect among individuals with L = l, any of the above meth- 
Hestia 0 1 0 ods requires exchangeability and positivity in that subset only; to estimate 
Poseidon 0 1 0 the marginal effect in the entire population, exchangeability and positivity are 
Hera 0 1 1 required in all levels of L. 

Apus ; SE In the absence of effect modification, the effect measures (risk ratio or risk 
Artemis 1 0 1 ; ; : 
difference) computed via these four approaches will be equal. For example, 
Apollo 1 0 1 i 
Lat 1 o 0 we concluded that the average causal effect of heart transplant A on mortality 
eee Y was null both in the entire population of Table 2.2 (standardization and IP 
Ares 1 1 1 re 3 i ota ae ak 
Athena i li 4 weighting), in the subsets of the population in critical condition L = 1 and non 
critical condition L = 0 (stratification), and in the untreated (matching). All 
Hephaestus 1 1 1 : . . 
; methods resulted in a causal risk ratio equal to 1. However, the effect measures 
Aphrodite 1 1 0 ; ; : 
computed via these four approaches will not generally be equal. To illustrate 
Cyclope 1 1 0 
P h 1 10 how the effects may vary, let us compute the effect of heart transplant A on 
one 1 10 high blood pressure Z (1: yes, 0 otherwise) using the data in Table 4.3. We 
Heb i 1 9 assume that exchangeability Z*1LA|LZ and positivity hold. We use the risk 
N ratio scale for no particular reason. 
Dionysus 1 1 0 
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Technical Point 4.3 


Relation between marginal and conditional risk ratios. Suppose we wish to determine under which con- 
ditions the marginal risk ratio Pr [yem = 1] /Pr [eee = 1] will be less than 1 given that we know the val- 
ues of the conditional risk ratios Pr[Y¢-! = 1|L = l| /Pr[Y*-° =1|L=1] for each stratum l. To do so, 
note that Pr [Y°=t = 1] /Pr [Y= =1] SA Pry aa Pa ee aS) with w(l) = 
{Pr [Y°=? = 1|L =1] Pr [L = l]} /Pr[Y2-° = 1] and $, w (l) = 1. Substituting for w (1) and w (0) followed by 
some algebraic manipulations will provide the condition under which the inequality Pr [y= = 1] /Pr vee = 1] <1 
holds. 

In our data example, Pr [Y°=t! = 1|L = I] /Pr[Y*° =1|L=1] is 0.5 for L = 1 and 2.0 for L = 0. 
Therefore the marginal risk ratio will be less than 1 if and only if Pr [ye=0 =1|L= | /Pr Le? =1|L= | > 
2Pr[L = 0] /Pr[L = 1]. 


Standardization and IP weighting yield the average causal effect in the 

entire population Pr[Z*=! = 1]/ Pr[Z*=° = 1] = 0.8 (these and the following 

Table 4.4 calculations are left to the reader). Stratification yields the conditional causal 
——— y risk ratios Pr[Z*=1 = 1|L = 0]/ Pr[Z*-° = 1|L = 0] = 2.0 in the stratum L = 





Rheia 1 4 r 0, and Pr[Z°=} = 1|L = 1]/ Pre 1|L = 1] = 0.5 in the stratum L = 1. 
Deinetër 1 0 0 Matching, using the matched pairs selected in the previous section, yields the 
Hestia 1 0 0 causal risk ratio in the untreated Pr[Z°=} = 1|A = 0]/ Pr[Z = 1|A = 0] = 1.0. 
Hera 1 0 0 We have computed four causal risk ratios and have obtained four differ- 
Artemis T -0b xt ent numbers: 0.8, 2.0, 0.5, and 1.0. All of them are correct. Leaving aside 
Leto 1 1 0 random variability (see Technical Point 4.2), the explanation of the differences 
Athena 1 11 is qualitative effect modification: Treatment doubles the risk among individ- 
Aphrodite 1 1 1 uals in noncritical condition (L = 0, causal risk ratio 2.0) and halves the risk 
Persephone 1 1 0 among individuals in critical condition (L = 1, causal risk ratio 0.5). The av- 
Hebe 1 11 erage causal effect in the population (causal risk ratio 0.8) is beneficial because 
Kionös 0 0 0 the ratio Pr [Z*~° = 1|L = 1] /Pr[Z*-° = 1|L = 0] of the counterfactual risk 
Hades 0 0 0 under no treatment in the critical group to that in the noncritical group ex- 
Poseidon 0 0 1 ceeds 2 times the odds Pr [L = 0] / Pr [L = 1] of being in the noncritical group 
Zeus 0 0 1 (see Technical Point 4.3). The causal effect in the untreated is null (causal 
Apollo 0 0 0 risk ratio 1.0), which reflects the larger proportion of individuals in noncritical 
Ares 0 1 1 condition in the untreated compared with the entire population. This example 
Hephaestus 0 1 1 highlights the primary importance of specifying the population, or the subset 
Cyclope ee ee of a population, to which the effect measure corresponds. 

Hermes 0 1 0 The previous chapter argued that a well-defined causal effect is a prereq- 
Dionysus 0 1 1 uisite for meaningful causal inference. This chapter argues that a well charac- 


terized target population is another such prerequisite. Both prerequisites are 

automatically present in experiments that compare two or more interventions 
Part Il describes how standardiza- in a population that meets certain a priori eligibility criteria. However, these 
tion, IP weighting, and stratifica- prerequisites cannot be taken for granted in observational studies. Rather, in- 
tion can be used in combination vestigators conducting observational studies need to explicitly define the causal 
with parametric or semiparametric effect of interest and the subset of the population in which the effect is being 
models. For example, standard re- computed. Otherwise, misunderstandings might easily arise when effect mea- 
gression models are a form of strati- sures obtained via different methods are different. 
fication in which the association be- In our example above, one investigator who used IP weighting (and com- 
tween treatment and outcome is es- puted the effect in the entire population) and another one who used matching 
timated within levels of all the other (and computed the effect in the untreated) need not engage in a debate about 
covariates in the model. the superiority of one analytic approach over the other. Their discrepant effect 

measures result from the different causal question asked by each investigator 


4.6 Effect modification and adjustment methods 53 


rather than from their choice of analytic approach. In fact, the second investi- 
gator could have used IP weighting to compute the effect in the untreated or 
in the treated (see Technical Point 4.1). 

A final note. Stratification can be used to compute average causal effects 
in subsets of the population, but not individual (subject-specific) effects. As 
we have discussed earlier, individual causal effects can only be identified under 
extreme assumptions. See Fine Points 2.1 and 3.2. 
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Fine Point 4.3 


Collapsibility and the odds ratio. In the absence of multiplicative effect modification by V, the causal risk ratio in 
the entire population, Pr[Y°=! = 1]/ Pr[Y °=? = 1] is equal to the conditional causal risk ratios Pr[Y¢=! = 1|V = 
v]/ Pr[ Y=? = 1|V = v] in every stratum v of V. More generally, the causal risk ratio is a weighted average of the 
stratum-specific risk ratios. For example, if the causal risk ratios in the strata V = 1 and V = 0 were equal to 2 and 3, 
respectively, then the causal risk ratio in the population would be greater than 2 and less than 3. That the value of the 
causal risk ratio (and the causal risk difference) in the population is always constrained by the range of values of the 
stratum-specific risk ratios is not only obvious but also a desirable characteristic of any effect measure. 

Now consider a hypothetical effect measure (other than the risk ratio or the risk difference) such that the population 
effect measure were not a weighted average of the stratum-specific measures. That is, the population effect measure 
would not necessarily lie inside of the range of values of the stratum-specific effect measures. Such effect measure would 
be an odd one. The odds ratio (pun intended) is such an effect measure, as we now discuss. 

Suppose the data in Table 4.4 were collected to compute the causal effect of altitude A on depression Y in a 
population of 20 individuals who were not depressed at baseline. The treatment A is 1 if the individual moved to a 
high altitude residence (on the top of Mount Olympus), 0 otherwise; the outcome Y is 1 if the individual subsequently 
developed depression, 0 otherwise; and V is 1 if the individual was female, 0 if male. The decision to move was random, 
i.e., those more prone to develop depression were as likely to move as the others; effectively Y° lA. Therefore the 
risk ratio Pr[Y = 1|A = 1]/Pr[Y = 1]A = 0] = 2.3 is the causal risk ratio in the population, and the odds ratio 
Pr[Y = 1|A = 1]/Pr[Y =0|A=1] Í so Pty et = 1]/ Pr[Y °= = 0] 
Pr[y = 1|A = 0]/Prly = 0|A = 0] = 5.4 is the causal odds ratio Pr[y@=0 = 1]/Pr[y=0 = 0] 
The risk ratio and the odds ratio measure the same causal effect on different scales. 

Let us now compute the sex-specific causal effects on the risk ratio and odds ratio scales. The (conditional) causal 

risk ratio Pr[Y = 1]V = v,A = 1]/PrlY = 1|V = v,A = 0] is 2 for men (V =0) and 3 for women (V = 1). 
Pr[Y = 1|V =v, A = 1]|/ Pr[Y =0|V =v, A = 1]. 
PrY = 1|V =v, A =0)/PriY < 0V =v, A= 0 is 6 for men (V = 0) and 6 for 
women (V = 1). The causal risk ratio in the population, 2.3, is in between the sex-specific causal risk ratios 2 and 3. In 
contrast, the causal odds ratio in the population, 5.4, is smaller (i.e., closer to the null value) than both sex-specific odds 
ratios, 6. The causal effect, when measured on the odds ratio scale, is bigger in each half of the population than in the 
entire population. The population causal odds ratio can be closer to the null value than the non-null stratum-specific 
causal odds ratio when V is an independent risk factor for Y and, as in our randomized experiment, A is independent 
of V (Miettinen and Cook, 1981). 

We say that an effect measure is collapsible when the population effect measure can be expressed as a weighted 
average of the stratum-specific measures. In follow-up studies the risk ratio and the risk difference are collapsible effect 
measures, but the odds ratio—or the rarely used odds difference—is not (Greenland 1987). The noncollapsibility of the 
odds ratio, which is a special case of Jensen’s inequality (Samuels 1981), may lead to counterintuitive findings like those 
described above. The odds ratio is collapsible under the sharp null hypothesis—both the conditional and unconditional 
effect measures are then equal to the null value—and it is approximately collapsible—and approximately equal to the 
risk ratio—when the outcome is rare (say, < 10%) in every stratum of a follow-up study. 

One important consequence of the noncollapsibility of the odds ratio is the logical impossibility of equating “lack of 
exchangeability” and “change in the conditional odds ratio compared with the unconditional odds ratio.” In our example, 
the change in odds ratio was about 10% (1 — 6/5.4) even though the treated and the untreated were exchangeable. 
Greenland, Robins, and Pearl (1999) reviewed the relation between noncollapsibility and lack of exchangeability. 


in the population. 

















The (conditional) causal odds ratio 


Chapter 5 
INTERACTION 


Consider again a randomized experiment to answer the causal question “does one’s looking up at the sky make 
other pedestrians look up too?” We have so far restricted our interest to the causal effect of a single treatment 
(looking up) in either the entire population or a subset of it. However, many causal questions are actually about 
the effects of two or more simultaneous treatments. For example, suppose that, besides randomly assigning your 
looking up, we also randomly assign whether you stand in the street dressed or naked. We can now ask questions 
like: what is the causal effect of your looking up if you are dressed? And if you are naked? If these two causal 
effects differ we say that the two treatments under consideration (looking up and being dressed) interact in bringing 
about the outcome. 

When joint interventions on two or more treatments are feasible, the identification of interaction allows one 
to implement the most effective interventions. Thus understanding the concept of interaction is key for causal 
inference. This chapter provides a formal definition of interaction between two treatments, both within our 
already familiar counterfactual framework and within the sufficient-component-cause framework. 


5.1 Interaction requires a joint intervention 


Suppose that in our heart transplant example, individuals were assigned to 
receiving either a multivitamin complex (EF = 1) or no vitamins (E = 0) 
before being assigned to either heart transplant (A = 1) or no heart trans- 
plant (A = 0). We can now classify all individuals into 4 treatment groups: 
vitamins-transplant (E = 1, A = 1), vitamins-no transplant (E = 1, A = 0), 
no vitamins-transplant (E = 0, A = 1), and no vitamins-no transplant (E = 0, 
A = 0). For each individual, we can now imagine 4 potential or counterfac- 
tual outcomes, one under each of these 4 treatment combinations: Y¢=!¢=1, 
yorte=0 ya=0e=l and Y2=9°=9_ In general, an individual’s counterfactual 
outcome Y“* is the outcome that would have been observed if we had inter- 
vened to set the individual’s values of A and E to a and e, respectively. We 
The counterfactual Y° correspond- refer to interventions on two or more treatments as joint interventions. 
ing to an intervention on A alone We are now ready to provide a definition of interaction within the coun- 
is the joint counterfactual Y° if terfactual framework. There is interaction between two treatments A and E 
the observed E takes the value e, if the causal effect of A on Y after a joint intervention that set Æ to 1 differs 
i.e, Y° = Y*". In fact, consis- from the causal effect of A on Y after a joint intervention that set E to 0. For 
tency is a special case of this recur- example, there would be an interaction between transplant A and vitamins 
sive substitution. Specifically, the if the causal effect of transplant on survival had everybody taken vitamins 
observed Y = Y4 = Y4.F, which were different from the causal effect of transplant on survival had nobody taken 
is our definition of consistency. See vitamins. 
also Technical Point 6.2. When the causal effect is measured on the risk difference scale, we say that 
there is interaction between A and E on the additive scale in the population if 


Pr [yerhe=} = 1] —Pr [Vee = 1] # Pr [YON = 1] Pr [y0 = 1], 


For example, suppose the causal risk difference for transplant A when every- 
body receives vitamins, Pr [Y¢-1¢=1 = 1] — Pr [Y°=°=1 = 1], were 0.1, and 
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5.2 Identifying interaction 


Interaction 


that the causal risk difference for transplant A when nobody receives vita- 
mins, Pr [Y¢=1e=° = 1] — Pr [Y2=%e=9 = 1], were 0.2. We say that there 
is interaction between A and E on the additive scale because the risk dif- 
ference Pr [Y¢=1¢=1 = 1| — Pr [Y*=®e¢=! = 1] is less than the risk difference 
Pr [Yo=he=® = 1] — Pr [Y2=®e=° = 1]. Using simple algebra, it can be easily 
shown that this inequality implies that the causal risk difference for vitamins E 
when everybody receives a transplant, Pr [Y¢=e=! = 1] — Pr [Ye=be-° = 1], 
is also less than the causal risk difference for vitamins Æ when nobody re- 
ceives a transplant A, Pr [Y°-®¢=! = 1] — Pr [Y2-®-°=® = 1]. That is, we can 
equivalently define interaction between A and E on the additive scale as 


Pr [Yerhe=} = 1] —Pr [Verhe—8 = 1] # Pr [YOO = 1] Pr [y0 = 1], 


The two inequalities displayed above show that treatments A and E have equal 
status in the definition of interaction. 

Let us now review the difference between interaction and effect modifica- 
tion. As described in the previous chapter, a variable V is a modifier of the 
effect of A on Y when the average causal effect of A on Y varies across levels of 
V. Note the concept of effect modification refers to the causal effect of A, not 
to the causal effect of V. For example, sex was an effect modifier for the effect 
of heart transplant in Table 4.1, but we never discussed the effect of sex on 
death. Thus, when we say that V modifies the effect of A we are not consid- 
ering V and A as variables of equal status, because only A is considered to be 
a variable on which we could hypothetically intervene. That is, the definition 
of effect modification involves the counterfactual outcomes Y“%, not the coun- 
terfactual outcomes Y*”. In contrast, the definition of interaction between A 
and E gives equal status to both treatments A and E, as reflected by the two 
equivalent definitions of interaction shown above. The concept of interaction 
refers to the joint causal effect of two treatments A and E, and thus involves 
the counterfactual outcomes Y®°®° under a joint intervention. 


In previous chapters we have described the conditions that are required to 
identify the average causal effect of a treatment A on an outcome Y, either 
in the entire population or in a subset of it. The three key identifying condi- 
tions were exchangeability, positivity, and consistency. Because interaction is 
concerned with the joint effect of two (or more) treatments A and E, identi- 
fying interaction requires exchangeability, positivity, and consistency for both 
treatments. 

Suppose that vitamins Æ were randomly, and unconditionally, assigned by 
the investigators. Then positivity and consistency hold, and the treated E = 1 
and the untreated Æ = 0 are expected to be exchangeable. That is, the risk 
that would have been observed if all individuals had been assigned to transplant 
A = 1 and vitamins E = 1 equals the risk that would have been observed if 
all individuals who received Æ = 1 had been assigned to transplant A = 1. 
Formally, the marginal risk Pr [Y°='°=! = 1] is equal to the conditional risk 
Pr [Y°=! = 1|E = 1]. As a result, we can rewrite the definition of interaction 
between A and E on the additive scale as 


Pe ye = 1|E = 1] — Pr [Y°™ = 1|E = 1] 
+ Pr [Y° = 1|E = 0] — Pr [Y° = 1|E = 0], 
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Interaction on the additive and multiplicative scales. The equality of causal risk differences Pr [yie = 1] — 
Peyote = 1| = Pr [Y05te= = 1] = Pr [yee = 1] can be rewritten as 


Bel Yeo) aa Pr eee Oe ee ee yp eee el) 
By subtracting Pr [Y°=°-°=° = 1] from both sides of the equation, we get Pr [Y°=}e=? = 1] — Pr [y °=%= = 1] = 
{Pry = 1] — Pr [Y5 eT Prey ee eS 1] į Pr [y5 = h. 


This equality is another compact way to show that treatments A and E have equal status in the definition of interaction. 

When the above equality holds, we say that there is no interaction between A and E on the additive scale, and we 
say that the causal risk difference Pr [Y¢=!=! = 1] — Pr [Y°=°-e=9 = 1] is additive because it can be written as the 
sum of the causal risk differences that measure the effect of A in the absence of E and the effect of E in the absence of 
A. Conversely, there is interaction between A and E on the additive scale if Pr [Y°~}°! = 1] — Pr [Y*"0? = 1] F 


{Pr Ly ean = 1] — Pr asia 1] } + {Pr [Y5 = 1] į Pr Ys a el he 


The interaction is superadditive if the ‘not equal to’ (#) symbol can be replaced by a ‘greater than’ (>) symbol. The 
interaction is subadditive if the ‘not equal to’ (4) symbol can be replaced by a ‘less than’ (<) symbol. 

Analogously, one can define interaction on the multiplicative scale when the effect measure is the causal risk ratio, 
rather than the causal risk difference. We say that there is interaction between A and E on the multiplicative scale if 


Pr eae = 1] Pr [YAR = 1] Pr [YON = 1] 
Pr [y e500 = 1] * Pr [Y¥e=0,e=0 = 1] x Pr [Ye=0,e=0 = 1] g 

The interaction is supermultiplicative if the ‘not equal to’ (4) symbol can be replaced by a ‘greater than’ (>) symbol. 
The interaction is submultiplicative if the ‘not equal to’ (4) symbol can be replaced by a ‘less than’ (<) symbol. 


which is exactly the definition of modification of the effect of A by E on the 
additive scale. In other words, when treatment E is randomly assigned, then 
the concepts of interaction and effect modification coincide. The methods 
described in Chapter 4 to identify modification of the effect of A by V can now 
be applied to identify interaction of A and E by simply replacing the effect 
modifier V by the treatment E. 


Now suppose treatment E was not assigned by investigators. To assess the 
presence of interaction between A and FE, one still needs to compute the four 
marginal risks Pr [Y*° = 1]. In the absence of marginal randomization, these 
risks can be computed for both treatments A and E, under the usual identifying 
assumptions, by standardization or IP weighting conditional on the measured 
covariates. An equivalent way of conceptualizing this problem follows: rather 
than viewing A and E as two distinct treatments with two possible levels (1 
or 0) each, one can view AE as a combined treatment with four possible levels 
(11, 01, 10, 00). Under this conceptualization the identification of interaction 
between two treatments is not different from the identification of the causal 
effect of one treatment that we have discussed in previous chapters. The same 
methods, under the same identifiability conditions, can be used. The only 
difference is that now there is a longer list of values that the treatment of 
interest can take, and therefore a greater number of counterfactual outcomes. 


Sometimes one may be willing to assume (conditional) exchangeability for 


58 


Interaction between A and E with- 
out modification of the effect of 
A by E is also logically possible, 
though probably rare, because it re- 
quires dual effects of A and exact 
cancellations (VanderWeele 2009). 
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treatment A but not for treatment F, e.g., when estimating the causal effect 
of A in subgroups defined by E in a randomized experiment. In that case, one 
cannot generally assess the presence of interaction between A and E, but can 
still assess the presence of effect modification by Æ. This is so because one 
does not need any identifying assumptions involving E to compute the effect 
of A in each of the strata defined by Æ. In the previous chapter we used the 
notation V (rather than E) for variables for which we are not willing to make 
assumptions about exchangeability, positivity, and consistency. For example, 
we concluded that the effect of transplant A was modified by nationality V, 
but we never required any identifying assumptions for the effect of V because 
we were not interested in using our data to compute the causal effect of V 
on Y. In Section 4.2 we argued on substantive grounds that V is a surrogate 
effect modifier; that is, V does not act on the outcome and therefore does not 
interact with A—no action, no interaction. But V is a modifier of the effect 
of A on Y because V is correlated with (e.g., it is a proxy for) an unidentified 
variable that actually has an effect on Y and interacts with A. Thus there 
can be modification of the effect of A by another variable without interaction 
between A and that variable. 

In the above paragraphs we have argued that a sufficient condition for 
identifying interaction between two treatments A and F is that exchangeability, 
positivity, and consistency are all satisfied for the joint treatment (A, E) with 
the four possible values (0,0), (0,1), (1,0), and (1,1). Then standardization 
or IP weighting can be used to estimate the joint effects of the two treatments 
and thus to evaluate interaction between them. In Part III, we show that this 
condition is not necessary when the two treatments occur at different times. 
For the remainder of Part I (except this chapter) and most of Part II, we will 
focus on the causal effect of a single treatment A. 

In Chapter 1 we described deterministic and nondeterministic counterfac- 
tual outcomes. Up to here, we used deterministic counterfactuals for simplicity. 
However, none of the results we have discussed for population causal effects 
and interactions require deterministic counterfactual outcomes. In contrast, 
the following section of this chapter only applies in the case that counterfactu- 
als are deterministic. Further, we also assume that treatments and outcomes 
are dichotomous. 


5.3 Counterfactual response types and interaction 





Table 5.1 
Type yeu yI 
Doomed 1 1 
Helped 1 0 
Hurt 0 1 
Immune 0 0 


Individuals can be classified in terms of their deterministic counterfactual re- 
sponses. For example, in Table 4.1 (same as Table 1.1), there are four types 
of people: the “doomed” who will develop the outcome regardless of what 
treatment they receive (Artemis, Athena, Persephone, Ares), the “immune” 
who will not develop the outcome regardless of what treatment they receive 
(Demeter, Hestia, Hera, Hades), the “helped” who will develop the outcome 
only if untreated (Hebe, Kronos, Poseidon, Apollo, Hermes, Dyonisus), and the 
“hurt” who will develop the outcome only if treated (Rheia, Leto, Aphrodite, 
Zeus, Hephaestus, Cyclope). Each combination of counterfactual responses is 
often referred to as a response pattern or a response type. Table 5.1 display 
the four possible response types. 

When considering two dichotomous treatments A and E, there are 16 pos- 
sible response types because each individual has four counterfactual outcomes, 
one under each of the four possible joint interventions on treatments A and 
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Table 5.2 
Y®®* for each a,e value 








Type 1,1 0,1 1,0 0,0 
1 1 1 1 1 
2 1 1 1 0 
Sy A 1 0 1 
4 1 1 0 0 
5+ -1 0 1 1 
6 1 0 1 0 
7 1 0 0 1 
8 l 0 0 0 
9 0 1 1 1 

10 0 1 1 0 
11 0 1 0 1 
12 0 1 0 0 
13 0 0 1 1 
0 0 1 0 
0 0 0 1 
0 0 0 0 


Miettinen (1982) described the 16 
possible response types under two 
binary treatments and outcome. 


Greenland and Poole (1988) noted 
that Miettinen’s response types 
were not invariant to recoding of 
A and E (i.e., switching the labels 
“0” and “1” ). They partitioned the 
16 response types of Table 5.2 into 
these three equivalence classes that 
are invariant to recoding. 


E: (1,1), (0,1), (1,0), and (0,0). Table 5.2 shows the 16 response types for 
two treatments. This section explores the relation between response types and 
the presence of interaction in the case of two dichotomous treatments A and 
E and a dichotomous outcome Y. 

The first type in Table 5.2 has the counterfactual outcome Y°=1°=! equal 
to 1, which means that an individual of this type would die if treated with 
both transplant and vitamins. The other three counterfactual outcomes are 
also equal to 1, i.e., Yeqbe=t = y4=0,e=1 — ya=1,e=0 — ya=0,e=0 — 1, which 
means that an individual of this type would also die if treated with (no trans- 
plant, vitamins), (transplant, no vitamins), or (no transplant, no vitamins). 
In other words, neither treatment A nor treatment Æ has any effect on the 
outcome of such individual. He would die no matter what joint treatment he 
is assigned to. Now consider type 16. All the counterfactual outcomes are 0, 
ie., Yore=! — ye=0.e=1 — ya=l,e=0 _ ya—0,e—0 _ 0, Again, neither treat- 
ment A nor treatment E has any effect on the outcome of an individual of this 
type. She would survive no matter what joint treatment she is assigned to. 
If all individuals in the population were of types 1 and 16, we would say that 
neither A nor E has any causal effect on Y; the sharp causal null hypothesis 
would be true for the joint treatment (A, E). 

Let us now focus our attention on types 4, 6, 11, and 13. Individuals of type 
4 would only die if treated with vitamins, whether they do or do not receive 
a transplant, i.e., YeqTbe! = Ye=e=! — 1 and Yerbe=O = ya-Ge=0 — 0, 
Individuals of type 13 would only die if not treated with vitamins, whether 
they do or do not receive a transplant, i.e., YOre=! = Y=! = 0 and 
yarlbe=0 — ya=0.e=0 — 1, Individuals of type 6 would only die if treated 
with transplant, whether they do or do not receive vitamins, i.e., Yq"! = 
yorbe=0 = 1 and Ye~Oe=1 = Y2=9.e=0 = 0, Individuals of type 11 would only 
die if not treated with transplant, whether they do or do not receive vitamins, 
i.e., yorle=l — ya=1,e=0 — 0 and Y2=0-e=! — ya=0,e=0 — ], 

Of the 16 possible response types in Table 5.2, we have identified 6 types 
(numbers 1,4, 6, 11, 13, 16) with a common characteristic: for an individual 
with one of those response types, the causal effect of treatment A on the out- 
come Y is the same regardless of the value of treatment E, and the causal effect 
of treatment F on the outcome Y is the same regardless of the value of treat- 
ment A. In a population in which every individual has one of these 6 response 
types, the causal effect of treatment A in the presence of treatment E, as 
measured by the causal risk difference Pr [Y°=}°! = 1] — Pr [Y*=-e=1 = 1], 
would equal the causal effect of treatment A in the absence of treatment E, as 
measured by the causal risk difference Pr [Y¢=}°~® = 1] — Pr [Y2=%e° = 1]. 
That is, if all individuals in the population have response types 1, 4, 6, 11, 
13 and 16 then there will be no interaction between A and E on the additive 
scale. 

The presence of additive interaction between A and E implies that, for some 
individuals in the population, the value of their two counterfactual outcomes 
under A = a cannot be determined without knowledge of the value of Æ, and 
vice versa. That is, there must be individuals in at least one of the following 
three classes: 












































1. those who would develop the outcome under only one of the four treat- 
ment combinations (types 8, 12, 14, and 15 in Table 5.2) 


2. those who would develop the outcome under two treatment combinations, 
with the particularity that the effect of each treatment is exactly the 
opposite under each level of the other treatment (types 7 and 10) 
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Monotonicity of causal effects. Consider a setting with a dichotomous treatment A and outcome Y. The value 
of the counterfactual outcome Y°~° is greater than that of Y°=! only among individuals of the “helped” type. For 
the other 3 types, Y°=! > Y*~° or, equivalently, an individual's counterfactual outcomes are monotonically increasing 
(i.e., nondecreasing) in a. Thus, when the treatment cannot prevent any individual’s outcome (i.e., in the absence of 
“helped” individuals), all individuals’ counterfactual response types are monotonically increasing in a. We then simply 
say that the causal effect of A on Y is monotonic. 

The concept of monotonicity can be generalized to two treatments A and Æ. The causal effects of A and E 
on Y are monotonic if every individual's counterfactual outcomes Y“:© are monotonically increasing in both a and e. 
That is, if there are no individuals with response types (Y¢"'=! = 0, Ya>e=! = 1), (Yarhes! = 0, yarhe=0 = 1), 
(y= en = 0, y a=0,e=0 = 1), and pester! = 0, ya=0,e=0 = 1). 





3. those who would develop the outcome under three of the four treatment 
combinations (types 2, 3, 5, and 9) 


On the other hand, the absence of additive interaction between A and 
E implies that either no individual in the population belongs to one of the 
For more on cancellations that re- three classes described above, or that there is a perfect cancellation of equal 
sult in additivity even when inter- deviations from additivity of opposite sign. Such cancellation would occur, for 
action types are present, see Green- example, if there were an equal proportion of individuals of types 7 and 10, or 
land, Lash, and Rothman (2008). of types 8 and 12. 
The meaning of the term “interaction” is clarified by the classification of 
individuals according to their counterfactual response types (see also Fine Point 
5.1). We now introduce a tool to conceptualize the causal mechanisms involved 
in the interaction between two treatments. 


5.4 Sufficient causes 


The meaning of interaction is clarified by the classification of individuals ac- 
cording to their counterfactual response types. We now introduce a tool to 
represent the causal mechanisms involved in the interaction between two treat- 
ments. Consider again our heart transplant example with a single treatment 
A. As reviewed in the previous section, some individuals die when they are 
treated, others when they are not treated, others die no matter what, and 
others do not die no matter what. This variety of response types indicates 
that treatment A is not the only variable that determines whether or not the 
outcome Y occurs. 

Take those individuals who were actually treated. Only some of them died, 
which implies that treatment alone is insufficient to always bring about the 
outcome. As an oversimplified example, suppose that heart transplant A = 1 
only results in death in individuals allergic to anesthesia. We refer to the 
smallest set of background factors that, together with A = 1, are sufficient to 
inevitably produce the outcome as U1. The simultaneous presence of treatment 
(A = 1) and allergy to anesthesia (U; = 1) is a minimal sufficient cause of the 
outcome Y. 

Now take those individuals who were not treated. Again only some of them 
died, which implies that lack of treatment alone is insufficient to bring about 
the outcome. As an oversimplified example, suppose that no heart transplant 
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Fine Point 5.1 


More on counterfactual types and interaction. The classification of individuals by counterfactual response types 
makes it easier to consider specific forms of interaction. For example, we may be interested in learning whether some 
individuals will develop the outcome when receiving both treatments E = 1 and A = 1, but not when receiving only one 
of the two. That is, whether individuals with counterfactual responses Y°=1°=! = 1 and Y2-Oe=! — yo=le=0 — Q 
(types 7 and 8) exist in the population. VanderWeele and Robins (2007a, 2008) developed a theory of sufficient cause 
interaction for 2 and 3 treatments, and derived the identifying conditions for synergism that are described here. The 
following inequality is a sufficient condition for these individuals to exist: 








Pryor ere I (Pr eres a ee re a0 
or, equivalently, Pr [eee = 1] — Pr eae = 1] > Pr [eae = 1] 


That is, in an experiment in which treatments A and E are randomly assigned, one can compute the three counterfactual 
risks in the above inequality, and empirically check that individuals of types 7 and 8 exist. 

Because the above inequality is a sufficient but not a necessary condition, it may not hold even if types 7 and 8 
exist. In fact this sufficient condition is so strong that it may miss most cases in which these types exist. A weaker 
sufficient condition for synergism can be used if one knows, or is willing to assume, that receiving treatments A and E 
cannot prevent any individual from developing the outcome, i.e., if the effects are monotonic (see Technical Point 5.2). 
In this case, the inequality 


ae aa Pe ye ra) Spey ea ee yer 


is a sufficient condition for the existence of types 7 and 8. In other words, when the effects of A and E are monotonic, 
the presence of superadditive interaction implies the presence of type 8 (monotonicity rules out type 7). This sufficient 
condition for synergism under monotonic effects was originally reported by Greenland and Rothman in a previous edition 
of their book. It is now reported in Greenland, Lash, and Rothman (2008). 

In genetic research it is sometimes interesting to determine whether there are individuals of type 8, a form of inter- 
action referred to as compositional epistasis. VanderWeele (2010a) reviews empirical tests for compositional epistasis. 


A = 0 only results in death if individuals have an ejection fraction less than 
20%. We refer to the smallest set of background factors that, together with 
A = 0, are sufficient to produce the outcome as U2. The simultaneous absence 
of treatment (A = 0) and presence of low ejection fraction (Uz = 1) is another 
sufficient cause of the outcome Y. 

Finally, suppose there are some individuals who have neither U; nor U2 
and that would have developed the outcome whether they had been treated or 
untreated. The existence of these “doomed” individuals implies that there are 
some other background factors that are themselves sufficient to bring about 
the outcome. As an oversimplified example, suppose that all individuals with 

By definition of background factors, | pancreatic cancer at the start of the study will die. We refer to the smallest set 
the dichotomous variables U can- of background factors that are sufficient to produce the outcome regardless of 
not be intervened on, and cannot treatment status as Uo. The presence of pancreatic cancer (Up = 1) is another 
be affected by treatment A. sufficient cause of the outcome Y. 

We described 3 sufficient causes for the outcome: treatment A = 1 in 
the presence of U1, no treatment A = 0 in the presence of U2, and presence 
of Uo regardless of treatment status. Each sufficient cause has one or more 
components, e.g., A = 1 and U, = 1 in the first sufficient cause. Figure 5.1 
represents each sufficient cause by a circle and its components as sections of 
the circle. The term sufficient-component causes is often used to refer to the 
sufficient causes and their components. 
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Figure 5.1 


Greenland and Poole (1988) first 


enumerated 
causes. 


these 


9 


sufficient 


Interaction 


The graphical representation of sufficient-component causes helps visualize 
a key consequence of effect modification: as discussed in Chapter 4, the mag- 
nitude of the causal effect of treatment A depends on the distribution of effect 
modifiers. Imagine two hypothetical scenarios. In the first one, the population 
includes only 1% of individuals with U; = 1 (i.e., allergy to anesthesia). In 
the second one, the population includes 10% of individuals with U; = 1. The 
distribution of U> and Up is identical between these two populations. Now, 
separately in each population, we conduct a randomized experiment of heart 
transplant A in which half of the population is assigned to treatment A = 1. 
The average causal effect of heart transplant A on death will be greater in the 
second population because there are more individuals susceptible to develop 
the outcome if treated. One of the 3 sufficient causes, A = 1 plus U4 = 1, is 
10 times more common in the second population than in the first one, whereas 
the other two sufficient causes are equally frequent in both populations. 

The graphical representation of sufficient-component causes also helps vi- 
sualize an alternative concept of interaction, which is described in the next 
section. First we need to describe the sufficient causes for two treatments A 
and E. Consider our vitamins and heart transplant example. We have al- 
ready described 3 sufficient causes of death: presence/absence of A (or E) is 
irrelevant, presence of transplant A regardless of vitamins Æ, and absence of 
transplant A regardless of vitamins Æ. In the case of two treatments we need 
to add 2 more ways to die: presence of vitamins E regardless of transplant A, 
and absence of vitamins regardless of transplant A. We also need to add four 
more sufficient causes to accommodate those who would die only under certain 
combination of values of the treatments A and Æ. Thus, depending on which 
background factors are present, there are 9 possible ways to die: 


1. by treatment A (treatment E is irrelevant) 

2. by the absence of treatment A (treatment E is irrelevant) 
3. by treatment E (treatment A is irrelevant) 

4. by the absence of treatment E (treatment A is irrelevant) 
5. by both treatments A and E 

6. by treatment A and the absence of Æ 

7. by treatment E and the absence of A 

8. by the absence of both A and E 


9. by other mechanisms (both treatments A and E are irrelevant) 


In other words, there are 9 possible sufficient causes with treatment com- 
ponents A = 1 only, A = 0 only, E = 1 only, E = 0 only, A= 1 and E = 1, 
A= 1l and E = 0, A = 0 and E = 1, A = 0 and E = 0, and neither A nor 
E matter. Each of these sufficient causes includes a set of background factors 
from Uj,..., Ug and Uo. Figure 5.2 represents the 9 sufficient-component causes 
for two treatments A and E. 
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A=1 A=0 
U,=1 U,=1 
E=1 E=0 A=1 E=1 
U3=1 U,=1 
Us=1 
A=1 E=0 A=0 E=1 A=0 E=0 
Figure 5.2 Us=1 U7=1 Ug=1 


Not all 9 sufficient-component causes for a dichotomous outcome and two 
This graphical representation of treatments exist in all settings. For example, if receiving vitamins Æ = 1 does 
sufficient-component causes is of- not kill any individual, regardless of her treatment A, then the 3 sufficient 
ten referred to as “the causal pies.” causes with the component E = 1 will not be present. The existence of those 
3 sufficient causes would mean that some individuals (e.g., those with Us = 1) 
would be killed by receiving vitamins (E = 1), that is, their death would be 

prevented by not giving vitamins (E = 0) to them. 


5.5 Sufficient cause interaction 


The colloquial use of the term “interaction between treatments A and E” 
evokes the existence of some causal mechanism by which the two treatments 
work together (i.e., “interact”) to produce certain outcome. Interestingly, the 
definition of interaction within the counterfactual framework does not require 
any knowledge about those mechanisms nor even that the treatments work 
together (see Fine Point 5.3). In our example of vitamins F and heart trans- 
plant A, we said that there is an interaction between the treatments A and 
E if the causal effect of A when everybody receives E is different from the 
causal effect of A when nobody receives Æ. That is, interaction is defined 
by the contrast of counterfactual quantities, and can therefore be identified 
by conducting an ideal randomized experiment in which the conditions of ex- 
changeability, positivity, and consistency hold for both treatments A and E. 
There is no need to contemplate the causal mechanisms (physical, chemical, 
biologic, sociological...) that underlie the presence of interaction. 

This section describes a second concept of interaction that perhaps brings 
us one step closer to the causal mechanisms by which treatments A and E 
bring about the outcome. This second concept of interaction is not based on 
counterfactual contrasts but rather on sufficient-component causes, and thus 
we refer to it as interaction within the sufficient-component-cause framework 
or, for brevity, sufficient cause interaction. 

A sufficient cause interaction between A and F exists in the population if 
A and E occur together in a sufficient cause. For example, suppose individuals 
with background factors Us = 1 will develop the outcome when jointly receiving 


64 Interaction 





Fine Point 5.2 


From counterfactuals to sufficient-component causes, and vice versa. There is a correspondence between the 
counterfactual response types and the sufficient component causes. In the case of a dichotomous treatment and outcome, 
suppose an individual has none of the background factors Up, U1, U2. She will have an “immune” response type because 
she lacks the components necessary to complete all of the sufficient causes, whether she is treated or not. The table 
below displays the mapping between response types and sufficient-component causes in the case of one treatment A. 





Type qo). york Component causes 

Doomed 1 1 Uo = 1 or {U; = 1 and U2 = 1} 
Helped 1 0 Uo = 0 and U = 0 and Up = 1 
Hurt 0 1 Uo = 0 and U; = 1 and Up = 0 
Immune 0 0 Uo = 0 and U; = 0 and Up = 0 





A particular combination of component causes corresponds to one and only one counterfactual type. However, a 
particular response type may correspond to several combinations of component causes. For example, individuals of the 
“doomed” type may have any combination of component causes including Uo = 1, no matter what the values of U1 
and Uz are, or any combination including {U; = 1 and U2 = 1}. 

Sufficient-component causes can also be used to provide a mechanistic description of exchangeability Y° [| A. For 
a dichotomous treatment and outcome, exchangeability means that the proportion of individuals who would have the 
outcome under treatment, and under no treatment, is the same in the treated A = 1 and the untreated A = 0. That 
is, Pry? = 1|A = 1] = Pr[Y °=} = 1|A = 0] and Pry? = 1|A = 1] Pay? = 1|A = 0]. 

Now the individuals who would develop the outcome if treated are the “doomed” and the “hurt”, that is, those with 
Up = 1 or U = 1. The individuals who would get the outcome if untreated are the “doomed” and the “helped”, that is, 
those with Up = 1 or Uz = 1. Therefore there will be exchangeability if the proportions of “doomed” + “hurt” and of 
“doomed” + “helped” are equal in the treated and the untreated. That is, exchangeability for a dichotomous treatment 
and outcome can be expressed in terms of sufficient-component causes as Pr[Up = 1 or U; = 1|A = 1] = Pr[Uo = 1 or 
U, = 1|A = 0] and Pr[Up = 1 or Uz = 1|A = 1] = Pr[Up = 1 or Un = 1|A = 0]. 

For additional details see Greenland and Brumback (2002), Flanders (2006), and VanderWeele and Hernán (2006). 
Some of the above results were generalized to the case of two or more dichotomous treatments by VanderWeele and 
Robins (2008). 














vitamins (E = 1) and heart transplant (A = 1), but not when receiving only 
one of the two treatments. Then a sufficient cause interaction between A and 
E exists if there exists an individual with U; = 1. It then follows that if 








there exists an individual with counterfactual responses Y°=!°=! = 1 and 
yo-Oe=1 — ya=le—0 — 0, a sufficient cause interaction between A and E is 
present. 


Sufficient cause interactions can be synergistic or antagonistic. There is 
synergism between treatment A and treatment E when A = 1 and E = 1 
are present in the same sufficient cause, and antagonism between treatment 
A and treatment E when A = 1 and E = 0 (or A = 0 and E = 1) are 
present in the same sufficient cause. Alternatively, one can think of antagonism 
between treatment A and treatment E as synergism between treatment A and 
no treatment E (or between no treatment A and treatment E). 

Unlike the counterfactual definition of interaction, sufficient cause inter- 
action makes explicit reference to the causal mechanisms involving the treat- 
ments A and E. One could then think that identifying the presence of sufficient 
cause interaction requires detailed knowledge about these causal mechanisms. 
It turns out that this is not always the case: sometimes we can conclude that 
sufficient cause interaction exists even if we lack any knowledge whatsoever 
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Fine Point 5.3 


Biologic interaction. In epidemiologic discussions, sufficient cause interaction is commonly referred to as biologic 
interaction (Rothman et al, 1980). This choice of terminology might seem to imply that, in biomedical applications, 
there exist biological mechanisms through which two treatments A and E act on each other in bringing about the 
outcome. However, this may not be necessarily the case as illustrated by the following example proposed by VanderWeele 
and Robins (2007a). 

Suppose A and E are the two alleles of a gene that produces an essential protein. Individuals with a deleterious 
mutation in both alleles (A = 1 and E = 1) will lack the essential protein and die within a week after birth, whereas 
those with a mutation in none of the alleles (i.e., A = 0 and E = 0) or in only one of the alleles (i.e., A= 0 and E =1, 
A = 1 and E = 0 ) will have normal levels of the protein and will survive. We would say that there is synergism between 
the alleles A and Æ because there exists a sufficient component cause of death that includes A = 1 and E = 1. That 
is, both alleles work together to produce the outcome. However, it might be argued that they do not physically act on 
each other and thus that they do not interact in any biological sense. 


Rothman (1976) described the con- 
cepts of synergism and antagonism 
within the  sufficient-component- 
cause framework. 


about the sufficient causes and their components. Specifically, if the inequal- 
ities in Fine Point 5.1 hold, then there exists synergism between A and E. 
That is, one can empirically check that synergism is present without ever giv- 
ing any thought to the causal mechanisms by which A and E work together 
to bring about the outcome. This result is not that surprising because of the 
correspondence between counterfactual response types and sufficient causes 
(see Fine Point 5.2), and because the above inequality is a sufficient but not a 
necessary condition, i.e., the inequality may not hold even if synergism exists. 


5.6 Counterfactuals or sufficient-component causes? 


A counterfactual framework of cau- 
sation was already hinted by Hume 
(1748). 


The _ sufficient-component-cause 
framework was developed in phi- 
losophy by Mackie (1965). He 
introduced the concept of INUS 
condition for Y: an /nsufficient 
but Necessary part of a condition 
which is itself Unnecessary but 
exclusively Sufficient for Y. 


The sufficient-component-cause framework and the counterfactual (potential 
outcomes) framework address different questions. The sufficient component 
cause model considers sets of actions, events, or states of nature which together 
inevitably bring about the outcome under consideration. The model gives an 
account of the causes of a particular effect. It addresses the question, “Given a 
particular effect, what are the various events which might have been its cause?” 
The potential outcomes or counterfactual model focuses on one particular cause 
or intervention and gives an account of the various effects of that cause. In 
contrast to the sufficient component cause framework, the potential outcomes 
framework addresses the question, “What would have occurred if a particular 
factor were intervened upon and thus set to a different level than it in fact 
was?” Unlike the sufficient component cause framework, the counterfactual 
framework does not require a detailed knowledge of the mechanisms by which 
the factor affects the outcome. 

The counterfactual approach addresses the question “what happens?” The 
sufficient-component-cause approach addresses the question “how does it hap- 
pen?” For the contents of this book—conditions and methods to estimate the 
average causal effects of hypothetical interventions—the counterfactual frame- 
work is the natural one. The sufficient-component-cause framework is helpful 
to think about the causal mechanisms at work in bringing about a particular 
outcome. Sufficient-component causes have a rightful place in the teaching of 
causal inference because they help understand key concepts like the dependence 
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Fine Point 5.4 


More on the attributable fraction. Fine Point 3.4 defined the excess fraction for treatment A as the proportion of 
cases attributable to treatment A in a particular population, and described an example in which the excess fraction for 
A was 75%. That is, 75% of the cases would not have occurred if everybody had received treatment a = 0 rather than 
their observed treatment A. Now consider a second treatment E. Suppose that the excess fraction for Æ is 50%. Does 
this mean that a joint intervention on A and E could prevent 125% (75% + 50%) of the cases? Of course not. 

Clearly the excess fraction cannot exceed 100% for a single treatment (either A or Æ). Similarly, it should be 
clear that the excess fraction for any joint intervention on A and E cannot exceed 100%. That is, if we were allowed 
to intervene in any way we wish (by modifying A, E, or both) in a population, we could never prevent a fraction of 
disease greater than 100%. In other words, no more than 100% of the cases can be attributed to the lack of certain 
intervention, whether single or joint. But then why is the sum of excess fractions for two single treatments greater than 
100%? The sufficient-component-cause framework helps answer this question. 

As an example, suppose that Zeus had background factors U5 = 1 (and none of the other background factors) and 
was treated with both A = 1 and E = 1. Zeus would not have been a case if either treatment A or treatment Æ had 
been withheld. Thus Zeus is counted as a case prevented by an intervention that sets a = 0, i.e., Zeus is part of the 
75% of cases attributable to A. But Zeus is also counted as a case prevented by an intervention that sets e = 0, i.e., 
Zeus is part of the 50% of cases attributable to Æ. No wonder the sum of the excess fractions for A and E exceeds 
100%: some individuals like Zeus are counted twice! 

The sufficient-component-cause framework shows that it makes little sense to talk about the fraction of disease 
attributable to A and E separately when both may be components of the same sufficient cause. For example, the 
discussion about the fraction of disease attributable to either genes or environment is misleading. Consider the mental 
retardation caused by phenylketonuria, a condition that appears in genetically susceptible individuals who eat certain 
foods. The excess fraction for those foods is 100% because all cases can be prevented by removing the foods from 
the diet. The excess fraction for the genes is also 100% because all cases would be prevented if we could replace the 
susceptibility genes. Thus the causes of mental retardation can be seen as either 100% genetic or 100% environmental. 
See Rothman, Greenland, and Lash (2008) for further discussion. 





of the magnitude of causal effects on the distribution of background factors (ef- 
fect modifiers), and the relationship between effect modification, interaction, 
and synergism. 
Though the sufficient-component-cause framework is useful from a peda- 
gogic standpoint, its relevance to actual data analysis is yet to be determined. 
In its classical form, the sufficient-component-cause framework is determinis- 
tic, its conclusions depend on the coding on the outcome, and is by definition 
limited to dichotomous treatments and outcomes (or to variables that can be 
recoded as dichotomous variables). This limitation practically rules out the 
consideration of any continuous factors, and restricts the applicability of the 
framework to contexts with a small number of dichotomous factors. More 
recent extensions of the sufficient-component-cause framework to stochastic 
VanderWeele (2010b) provided ex- settings and to categorical and ordinal treatments might lead to an increased 
tensions to 3-level treatments. application of this approach to realistic data analysis. Finally, even allowing for 
VanderWeele and Robins (2012) these extensions of the sufficient-component-cause framework, we may rarely 
explored the relationship between have the large amount of data needed to study the fine distinctions it makes. 
stochastic counterfactuals and sto- To estimate causal effects more generally, the counterfactual framework will 
chastic sufficient causes. likely continue to be the one most often employed. Some apparently alternative 
frameworks—causal diagrams, decision theory—are essentially equivalent to 
the counterfactual framework, as described in the next chapter. 
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Technical Point 5.3 


Monotonicity of causal effects and sufficient causes. When treatment A and E have monotonic effects, then some 
sufficient causes are guaranteed not to exist. For example, suppose that cigarette smoking (A = 1) never prevents heart 
disease, and that physical inactivity (Æ = 1) never prevents heart disease. Then no sufficient causes including either 
A= 0 or E = 0 can be present. This is so because, if a sufficient cause including the component A = 0 existed, then 
some individuals (e.g., those with U2 = 1) would develop the outcome if they were unexposed (A = 0) or, equivalently, 
the outcome could be prevented in those individuals by treating them (A = 1). The same rationale applies to E = 0. 
The sufficient component causes that cannot exist when the effects of A and E are monotonic are crossed out in Figure 
5.3. 





Figure 5.3 
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Chapter 6 
GRAPHICAL REPRESENTATION OF CAUSAL EFFECTS 


Causal inference generally requires expert knowledge and untestable assumptions about the causal network linking 
treatment, outcome, and other variables. Earlier chapters focused on the conditions and methods to compute 
causal effects in oversimplified scenarios (e.g., the causal effect of your looking up on other pedestrians’ behavior, 
an idealized heart transplant study). The goal was to provide a gentle introduction to the ideas underlying the 
more sophisticated approaches that are required in realistic settings. Because the scenarios we considered were so 
simple, there was really no need to make the causal network explicit. As we start to turn our attention towards 
more complex situations, however, it will become crucial to be explicit about what we know and what we assume 
about the variables relevant to our particular causal inference problem. 

This chapter introduces a graphical tool to represent our qualitative expert knowledge and a priori assumptions 
about the causal structure of interest. By summarizing knowledge and assumptions in an intuitive way, graphs 
help clarify conceptual problems and enhance communication among investigators. The use of graphs in causal 
inference problems makes it easier to follow a sensible advice: draw your assumptions before your conclusions. 


6.1 Causal diagrams 


This chapter describes graphs, which we will refer to as causal diagrams, to 
represent key causal concepts. The modern theory of diagrams for causal infer- 
ence arose within the disciplines of computer science and artificial intelligence. 
Comprehensive books on this sub- This and the next three chapters are focused on problem conceptualization via 
ject have been written by Pearl causal diagrams. 
(2009) and Spirtes, Glymour and Take a look at the graph in Figure 6.1. It comprises three nodes representing 
Scheines (2000). random variables (L, A, Y) and three edges (the arrows). We adopt the 
convention that time flows from left to right, and thus L is temporally prior to 
A and Y, and A is temporally prior to Y. As in previous chapters, L, A, and 
Y represent disease severity, heart transplant, and death, respectively. 

The presence of an arrow pointing from a particular variable V to another 
variable W indicates that we know there is a direct causal effect (i.e., an 
effect not mediated through any other variables on the graph) for at least one 
individual. Alternatively, the lack of an arrow means that we know that V has 


—o no direct causal effect on W for any individual in the population. For example, 

L — A—~ Y in Figure 6.1, the arrow from L to A means that disease severity affects the 
probability of receiving a heart transplant. A standard causal diagram does 

Figure 6.1 not distinguish whether an arrow represents a harmful effect or a protective 


effect. Furthermore, if, as in figure 6.1, a variable (here, Y) has two causes, 
the diagram does not encode how the two causes interact. 

Causal diagrams like the one in Figure 6.1 are known as directed acyclic 
graphs, which is commonly abbreviated as DAGs. “Directed” because the edges 
imply a direction: because the arrow from L to A is into A, L may cause A, but 
not the other way around. “Acyclic” because there are no cycles: a variable 
cannot cause itself, either directly or through another variable. 

Directed acyclic graphs have applications other than causal inference. Here 
we focus on causal directed acyclic graphs. A defining property of causal DAGs 
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Causal directed acyclic graphs. We define a directed acyclic graph (DAG) G to be a graph whose nodes (vertices) 
are random variables V = (Vi,..., Vm) with directed edges (arrows) and no directed cycles. We use PAm to denote 
the parents of Vm, i.e., the set of nodes from which there is a direct arrow into Vm. The variable Vm is a descendant 
of V; (and V; is an ancestor of Vm ) if there is a sequence of nodes connected by edges between V; and Vm such that, 
following the direction indicated by the arrows, one can reach Vm by starting at Vj. For example, consider the DAG in 
Figure 6.1. In this DAG, M = 3 and we can choose Vi = L, V2 = A, and V3 = Y; the parents PA3 of V3 = Y are 
(L, A). We will adopt the ordering convention that if m > j, Vm is not an ancestor of V;. We define the distribution 
of V to be Markov with respect to a DAG G (equivalently, the distribution factors according to a DAG G) if, for each 
Jj, Vj is independent of its non-descendants conditional on its parents. 

A causal DAG is a DAG in which 1) the lack of an arrow from node V; to Vm (i.e., Vj is not a parent of Vm) can 
be interpreted as the absence of a direct causal effect of Vj on Vm relative to the other variables on the graph, 2) all 
common causes, even if unmeasured, of any pair of variables on the graph are themselves on the graph, and 3) any 
variable is a cause of its descendants. 

Causal DAGs are of no practical use unless we make an assumption linking the causal structure represented by 
the DAG to the data obtained in a study. This assumption, referred to as the causal Markov assumption, states that, 
conditional on its direct causes, a variable V; is independent of any variable for which it is not a cause. That is, 
conditional on its parents, V; is independent of its non-descendants. This latter statement is mathematically equivalent 
to the statement that the density f (V) of the variables V in DAG G satisfies the Markov factorization 


M 
f (0) = [[ F (0s | pas) . 


is that, conditional on its direct causes, any variable on the DAG is independent 
of any other variable for which it is not a cause. This assumption, referred to 
as the causal Markov assumption, implies that in a causal DAG the common 
causes of any pair of variables in the graph must be also in the graph. For a 
formal definition of causal DAGs, see Technical Point 6.1. 

For example, suppose in our study individuals are randomly assigned to 
heart transplant A with a probability that depends on the severity of their 
disease L. Then L is a common cause of A and Y, and needs to be included 
in the graph, as shown in the causal diagram in Figure 6.1. Now suppose 
in our study all individuals are randomly assigned to heart transplant with 

A —Y the same probability regardless of their disease severity. Then L is not a 
common cause of A and Y and need not be included in the causal diagram. 
Figure 6.1 represents a conditionally randomized experiment, whereas Figure 
6.2 represents a marginally randomized experiment. 

Figure 6.1 may also represent an observational study. Specifically, Figure 
6.1 represents an observational study in which we are willing to assume that 
the assignment of heart transplant A has as parent disease severity L and no 
other causes of Y. Otherwise, those causes of Y, even if unmeasured, would 
need to be included in the diagram, as they would be common causes of A and 
Y. In the next chapter we will describe how the willingness to consider Figure 
6.1 as the causal diagram for an observational study is the graphic translation 
of the assumption of conditional exchangeability given L, Y°1LA|L for all a. 

Many people find the graphical approach to causal inference easier to use 
and more intuitive than the counterfactual approach. However, the two ap- 
proaches are intimately linked. Specifically, associated with each graph is an 
underlying counterfactual model (see Technical Point 6.2). It is this model 


Figure 6.2 
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Richardson and Robins (2013) de- 
veloped the Single World Interven- 
tion Graph (SWIG). 


that provides the mathematical justification for the heuristic, intuitive graph- 
ical methods we now describe. However, conventional causal diagrams do not 
include the underlying counterfactual variables on the graph. Therefore the 
link between graphs and counterfactuals has traditionally remained hidden. 
A recently developed type of causal directed acyclic graph—the Single World 
Intervention Graph (SWIG)—seamlessly unifies the counterfactual and graph- 
ical approaches to causal inference by explicitly including the counterfactual 
variables on the graph. We defer the introduction of SWIGs until Chapter 7 
as the material covered in this chapter serves as a necessary prerequisite. 

Causal diagrams are a simple way to encode our subject-matter knowledge, 
and our assumptions, about the qualitative causal structure of a problem. But, 
as described in the next sections, causal diagrams also encode information 
about potential associations between the variables in the causal network. It 
is precisely this simultaneous representation of association and causation that 
makes causal diagrams such an attractive tool. What follows is an informal 
introduction to graphic rules to infer associations from causal diagrams. Our 
emphasis is on conceptual insight rather than on formal rigor. 
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Figure 6.3 


A path between two variables R and 
S in a DAG is a route that connects 
R and S' by following a sequence 
of edges such that the route vis- 
its no variable more than once. A 
path is causal if it consists entirely 
of edges with their arrows pointing 
in the same direction. Otherwise it 
is noncausal. 


Consider the following two examples. First, suppose you know that aspirin use 
A has a preventive causal effect on the risk of heart disease Y, i.e., Pr[Y¢=! = 
1] A Pr[Y°=° = 1]. The causal diagram in Figure 6.2 is the graphical transla- 
tion of this knowledge for an experiment in which aspirin A is randomly, and 
unconditionally, assigned. Second, suppose you know that carrying a lighter A 
has no causal effect (causative or preventive) on anyone’s risk of lung cancer Y, 
i.e., Pr[Y¢=! = 1] = Pr[Y°= = 1], and that cigarette smoking L has a causal 
effect on both carrying a lighter A and lung cancer Y. The causal diagram in 
Figure 6.3 is the graphical translation of this knowledge. The lack of an arrow 
between A and Y indicates that carrying a lighter does not have a causal effect 
on lung cancer; L is depicted as a common cause of A and Y. 

To draw Figures 6.2 and 6.3 we only used your knowledge about the causal 
relations among the variables in the diagram but, interestingly, these causal 
diagrams also encode information about the expected associations (or, more 
exactly, the lack of them) among the variables in the diagram. We now argue 
heuristically that, in general, the variables A and Y will be associated in both 
Figure 6.2 and 6.3, and describe key related results from causal graphs theory. 

Take first the randomized experiment represented in Figure 6.2. Intuitively 
one would expect that two variables A and Y linked only by a causal arrow 
would be associated. And that is exactly what causal graphs theory shows: 
when one knows that A has a causal effect on Y, as in Figure 6.2, then one 
should also generally expect A and Y to be associated. This is of course 
consistent with the fact that, in an ideal randomized experiment with un- 
conditional exchangeability, causation Pr[Y¢=! = 1] # Pr[Y*=° = 1] implies 
association Pr[Y = 1|A = 1] # Pr[Y = 1|A = 0], and vice versa. A heuristic 
that captures the causation-association correspondence in causal diagrams is 
the visualization of the paths between two variables as pipes or wires through 
which association flows. Association, unlike causation, is a symmetric relation- 
ship between two variables; thus, when present, association flows between two 
variables regardless of the direction of the causal arrows. In Figure 6.2 one 
could equivalently say that the association flows from A to Y or from Y to A. 
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Technical Point 6.2 


Counterfactual models associated with a causal DAG. In this book, a causal DAG G represents an underlying 
counterfactual model. To provide a formal definition of the counterfactual model represented by a DAG G, we use the 
following notation. For any random variable W, let W denote the support (i.e., the set of possible values w) of W. For 
any set of ordered variables W1,..., Wm, define Wm = (w1,...,Wm). Let R denote any subset of variables in V and 
let r be a value of R. Then Vy, denotes the counterfactual value of Vm when R is set to r. 

A nonparametric structural equation model (NPSEM) represented by a DAG G with vertex set V assumes the 
existence of unobserved random variables (errors) €m and deterministic unknown functions fm (pam, €m) such that 
Vi = fi (€1) and the one-step ahead counterfactual Vin" = Vem is given by fm (Pam, €m). That is, only the parents 
of Vm have a direct effect on Vm relative to the other variables on G. An NPSEM implies that any variable V; on 
the graph can be intervened on, as counterfactuals in which V; has been set to a specific value vj are assumed to 


exist. Both the factual variable Vm and the counterfactuals V% for any R C V are obtained recursively from V; and 


> vI 
Vi- m > j > 1. For example, V3) = yore , i.e., the counterfactual value V3" of V3 when Vj is set to vı is the 


j 
. oe V Vi 
one-step ahead counterfactual V} ™”? with v2 equal to the counterfactual value V3* of V2. Similarly, V3 = V3 1V2 


V3" = Vj! because V4 is not a direct cause of V3. 

Robins (1986) called this NPSEM a finest causally interpreted structural tree graph (FCISTGs) “as fine as the 
data”. Pearl (2000) showed how to represent this model with a DAG. Robins (1986) also proposed more realistic 
causally interpreted structural tree graphs in which only a subset of the variables are subject to intervention. For 
expositional purposes, we will assume that every variable can be intervened on, even though the statistical methods 
considered here do not actually require this assumption. 

A FCISTG model does not imply that the causal Markov assumption of Technical Point 6.1 holds; additional 
statistical independence assumptions are needed. For example, Pearl (2000) assumed an NPSEM in which all error 
terms €m are mutually independent. We refer to Pearl’s model with independent errors as an NPSEM-IE. In contrast, 
Robins (1986) only assumed that the one-step ahead counterfactuals Vin” = fm (Pam, €m) and vj = fj (paj, €j), 
j < m, are jointly independent when ọj—ı is a subvector of the Um-—1, and referred to this as the finest fully randomized 
causally interpreted structured tree graph (FFRCISTG) model. Robins (1986) showed this assumption implies that the 
causal Markov assumption holds. An NPSEM-IE is an FFRCISTG but not vice-versa because an NPSEM-IE makes 
many more independence assumptions than an FFRCISTG (Robins and Richardson 2011). 

A DAG represents an NPSEM but we need to specify which type. For example, the DAG in Figure 6.2 may 
correspond to either an NPSEM-IE that implies full exchangeability Catan ae) ILA, or to an FFRCISTG that only 
implies marginal exchangeability Y°1LA for both a = 0 and a = 1. In this book we assume that DAGs represent 
FFRCISTGs whenever we do not mention the underlying counterfactual model. 


and 





Now let us consider the observational study represented in Figure 6.3. We 
know that carrying a lighter A has no causal effect on lung cancer Y. The 
question now is whether carrying a lighter A is associated with lung cancer Y. 
That is, we know that Pr[Y¢=! = 1] = Pr[Y?=° = 1] but is it also true that 
Pr[Y = 1|A = 1] = Pr[Y = 1|A = 0]? To answer this question, imagine that a 
naive investigator decides to study the effect of carrying a lighter A on the risk 
of lung cancer Y (we do know that there is no effect but this is unknown to 
the investigator). He asks a large number of people whether they are carrying 
lighters and then records whether they are diagnosed with lung cancer during 
the next 5 years. Hera is one of the study participants. We learn that Hera 
is carrying a lighter. But if Hera is carrying a lighter (A = 1), then it is 
more likely that she is a smoker (L = 1), and therefore she has a greater than 
average risk of developing lung cancer (Y = 1). We then intuitively conclude 
that A and Y are expected to be associated because the cancer risk in those 
carrying a lighter (A = 1) is different from the cancer risk in those not carrying 
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a lighter (A = 0), or Pr[Y = 1|A = 1] # Pr[Y = 1|A = 0]. In other words, 
having information about the treatment A improves our ability to predict the 
outcome Y, even though A does not have a causal effect on Y. The investigator 
will make a mistake if he concludes that A has a causal effect on Y just because 
A and Y are associated. Causal graphs theory again confirms our intuition. In 
graphic terms, A and Y are associated because there is a flow of association 
from A to Y (or, equivalently, from Y to A) through the common cause L. 

Let us now consider a third example. Suppose you know that certain genetic 
haplotype A has no causal effect on anyone’s risk of becoming a cigarette 
smoker Y, i.e., Pr[Y°=! = 1] = Pr[Y °=? = 1], and that both the haplotype A 
and cigarette smoking Y have a causal effect on the risk of heart disease L. 
The causal diagram in Figure 6.4 is the graphical translation of this knowledge. 
The lack of an arrow between A and Y indicates that the haplotype does not 
have a causal effect on cigarette smoking, and L is depicted as a common 
effect of A and Y. The common effect L is referred to as a collider on the path 
A — L + Y because two arrowheads collide on this node. 

Again the question is whether A and Y are associated. To answer this 
question, imagine that another investigator decides to study the effect of hap- 
lotype A on the risk of becoming a cigarette smoker Y (we do know that there 
is no effect but this is unknown to the investigator). She makes genetic de- 
terminations on a large number of children, and then records whether they 
end up becoming smokers. Apollo is one of the study participants. We learn 
that Apollo does not have the haplotype (A = 0). Is he more or less likely 
to become a cigarette smoker (Y = 1) than the average person? Learning 
about the haplotype A does not improve our ability to predict the outcome Y 
because the risk in those with (A = 1) and without (A = 0) the haplotype is 
the same, or Pr[Y = 1|A = 1] = Pr[Y = 1|A = 0]. In other words, we would 
intuitively conclude that A and Y are not associated, i.e., A and Y are inde- 
pendent or ALLY. The knowledge that both A and Y cause heart disease L is 
irrelevant when considering the association between A and Y. Causal graphs 
theory again confirms our intuition because it says that colliders, unlike other 
variables, block the flow of association along the path on which they lie. Thus 
A and Y are independent because the only path between them, A> L <— Y, 
is blocked by the collider L. 

In summary, two variables are (marginally) associated if one causes the 
other, or if they share common causes. Otherwise they will be (marginally) in- 
dependent. The next section explores the conditions under which two variables 
A and Y may be independent conditionally on a third variable L. 
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We now revisit the settings depicted in Figures 6.2, 6.3, and 6.4 to discuss the 
concept of conditional independence in causal diagrams. 

According to Figure 6.2, we expect aspirin A and heart disease Y to be 
associated because aspirin has a causal effect on heart disease. Now suppose 
we obtain an additional piece of information: aspirin A affects the risk of heart 
disease Y because it reduces platelet aggregation B. This new knowledge is 
translated into the causal diagram of Figure 6.5 that shows platelet aggregation 
B (1: high, 0: low) as a mediator of the effect of A on Y. 

Once a third variable is introduced in the causal diagram we can ask a new 
question: is there an association between A and Y within levels of (conditional 
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Because no conditional indepen- 
dences are expected in complete 
causal diagrams (those in which all 
possible arrows are present), it is of- 
ten said that information about as- 
sociations is in the missing arrows. 
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Blocking the flow of association 
between treatment and outcome 
through the common cause is 
the graph-based justification to 
use stratification as a method to 
achieve exchangeability. 
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Graphical representation of causal effects 


on) B? Or, equivalently: when we already have information on B, does infor- 
mation about A improve our ability to predict Y? To answer this question, 
suppose data were collected on A, B, and Y in a large number of individuals, 
and that we restrict the analysis to the subset of individuals with low platelet 
aggregation (B = 0). The square box placed around the node B in Figure 6.5 
represents this restriction. (We would also draw a box around B if the analysis 
were restricted to the subset of individuals with B = 1.) 

Individuals with low platelet aggregation (B = 0) have a lower than average 
risk of heart disease. Now take one of these individuals. Regardless of whether 
the individual was treated (A = 1) or untreated (A = 0), we already knew 
that he has a lower than average risk because of his low platelet aggregation. 
In fact, because aspirin use affects heart disease risk only through platelet 
aggregation, learning an individual’s treatment status does not contribute any 
additional information to predict his risk of heart disease. Thus, in the subset of 
individuals with B = 0, treatment A and outcome Y are not associated. (The 
same informal argument can be made for individuals in the group with B = 1.) 
Even though A and Y are marginally associated, A and Y are conditionally 
independent (unassociated) given B because the risk of heart disease is the 
same in the treated and the untreated within levels of B: Pr[Y = 1|A = 
1,B = b] = Pr[Y = 1|A = 0, B = b] for all b. That is, AILY|B. Graphically, 
we say that a box placed around variable B blocks the flow of association 
through the path A > BY. 

Let us now return to Figure 6.3. We concluded in the previous section that 
carrying a lighter A was associated with the risk of lung cancer Y because 
the path A — L — Y was open to the flow of association from A to Y. The 
question we ask now is whether A is associated with Y conditional on L. This 
new question is represented by the box around L in Figure 6.6. Suppose the 
investigator restricts the study to nonsmokers (L = 1). In that case, learning 
that an individual carries a lighter (A = 1) does not help predict his risk of 
lung cancer (Y = 1) because the entire argument for better prediction relied 
on the fact that people carrying lighters are more likely to be smokers. This 
argument is irrelevant when the study is restricted to nonsmokers or, more 
generally, to people who smoke with a particular intensity. Even though A 
and Y are marginally associated, A and Y are conditionally independent given 
L because the risk of lung cancer is the same in the treated and the untreated 
within levels of L: Pr[Y = 1|A=1,£L = l] = Pr[Y = 1|A=0,L = l] for all 
l. That is, AILY|L. Graphically, we say that the flow of association between 
A and Y is interrupted because the path A — L — Y is blocked by the box 
around L. 

Finally, consider Figure 6.4 again. We concluded in the previous section 
that having the haplotype A was independent of being a cigarette smoker 
Y because the path between A and Y, A — L <— Y, was blocked by the 
collider L. We now argue heuristically that, in general, A and Y will be 
conditionally associated within levels of their common effect L. Suppose that 
the investigators, who are interested in estimating the effect of haplotype A 
on smoking status Y, restricted the study population to individuals with heart 
disease (L = 1). The square around L in Figure 6.7 indicates that they are 
conditioning on a particular value of L. Knowing that an individual with heart 
disease lacks haplotype A provides some information about her smoking status 
because, in the absence of A, it is more likely that another cause of L such 
as Y is present. That is, among people with heart disease, the proportion of 
smokers is increased among those without the haplotype A. Therefore, A and 
Y are inversely associated conditionally on L = 1. The investigator will make a 
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See Chapter 8 for more on associ- 
ations due to conditioning on com- 
mon effects. 


A ~ y¥—S1->[c] 
Figure 6.8 


The mathematical theory underly- 
ing the graphical rules is known as 
“d-separation” (Pearl 1995). 





L -> A— Y 


Figure 6.9 


mistake if he concludes that A has a causal effect on Y just because A and Y are 
associated within levels of L. In the extreme, if A and Y were the only causes 
of L, then among people with heart disease the absence of one of them would 
perfectly predict the presence of the other. Causal graphs theory shows that 
indeed conditioning on a collider like L opens the path A — L — Y, which 
was blocked when the collider was not conditioned on. Intuitively, whether 
two variables (the causes) are associated cannot be influenced by an event 
in the future (their effect), but two causes of a given effect generally become 
associated once we stratify on the common effect. 

As another example, the causal diagram in Figure 6.8 adds to that in Figure 
6.7 a diuretic medication C whose use is a consequence of a diagnosis of heart 
disease. A and Y are also associated within levels of C because C is a common 
effect of A and Y. Causal graphs theory shows that conditioning on a variable 
C affected by a collider L also opens the path A — L — Y. This path is blocked 
in the absence of conditioning on either the collider L or its consequence C. 

This and the previous section review three structural reasons why two vari- 
ables may be associated: one causes the other, they share common causes, or 
they share a common effect and the analysis is restricted to certain level of 
that common effect (or of its descendants). Along the way we introduced a 
number of graphical rules that can be applied to any causal diagram to deter- 
mine whether two variables are (conditionally) independent. The arguments 
we used to support these graphical rules were heuristic and relied on our causal 
intuitions. These arguments, however, have been formalized and mathemat- 
ically proven. See Fine Point 6.1 for a systematic summary of the graphical 
rules, and Fine Point 6.2 for an introduction to the concept of faithfulness. 

There is another possible source of association between two variables that 
we have not discussed yet: chance or random variability. Unlike the structural 
reasons for an association between two variables—causal effect of one on the 
other, shared common causes, conditioning on common effects—random vari- 
ability results in chance associations that become smaller when the size of the 
study population increases. 

To focus our discussion on structural associations rather than chance asso- 
ciations, we continue to assume until Chapter 10 that we have recorded data on 
every individual in a very large (perhaps hypothetical) population of interest. 


6.4 Positivity and consistency in causal diagrams 


Pearl (2009) reviews quantitative 
methods for causal inference that 
are derived from graph theory. 


Because causal diagrams encode our qualitative expert knowledge about the 
causal structure, they can be used as a visual aid to help conceptualize causal 
problems and guide data analyses. In fact, the formulas that we described in 
Chapter 2 to quantify treatment effects—standardization and IP weighting— 
can also be derived using causal graphs theory, as part of what is sometimes 
referred to as the do-calculus. Therefore, our choice of counterfactual theory 
in Chapters 1-5 did not really privilege one particular approach but only one 
particular notation. 

Regardless of the notation used (counterfactuals or graphs), exchangeabil- 
ity, positivity, and consistency are conditions required for causal inference via 
standardization or IP weighting. If any of these conditions does not hold, the 
numbers arising from the data analysis may not be appropriately interpreted 
as measures of causal effect. In the next section (and in Chapters 7 and 8) we 
discuss how the exchangeability condition is translated into graph language. 
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Fine Point 6.1 
D-separation. We define a path to be either blocked or open according to the following graphical rules. 


1. If there are no variables being conditioned on, a path is blocked if and only if two arrowheads on the path collide 
at some variable on the path. In Figure 6.1, the path L — A — Y is open, whereas the path A — Y < L is 
blocked because two arrowheads on the path collide at Y. We call Y a collider on the path A > Y <~ L. 


2. Any path that contains a non-collider that has been conditioned on is blocked. In Figure 6.5, the path between 
A and Y is blocked after conditioning on B. We use a square box around a variable to indicate that we are 
conditioning on it. 


3. A collider that has been conditioned on does not block a path. In Figure 6.7, the path between A and Y is open 
after conditioning on L. 


4. A collider that has a descendant that has been conditioned on does not block a path. In Figure 6.8, the path 
between A and Y is open after conditioning on C, a descendant of the collider L. 


Rules 1-4 can be summarized as follows. A path is blocked if and only if it contains a non-collider that has been 
conditioned on, or it contains a collider that has not been conditioned on and has no descendants that have been 
conditioned on. Two variables are d-separated if all paths between them are blocked (otherwise they are d-connected). 
Two sets of variables are d-separated if each variable in the first set is d-separated from every variable in the second set. 
Thus, A and L are not d-separated in Figure 6.1 because there is one open path between them (L — A), despite the 
other path (A — Y + L)’s being blocked by the collider Y. In Figure 6.4, however, A and Y are d-separated because 
the only path between them is blocked by the collider L. 

The relationship between statistical independence and the purely graphical concept of d-separation relies on the 
causal Markov assumption (Technical Point 6.1): In a causal DAG, any variable is independent of its non-descendants 
conditional on its parents. Pearl (1988) proved the following fundamental theorem: The causal Markov assumption 
implies that, given any three disjoint sets A, B, C of variables, if A is d-separated from B conditional on C, then A 
is statistically independent of B given C. The assumption that the converse holds, i.e., that A is d-separated from B 
conditional on C if A is statistically independent of B given C, is a separate assumption—the faithfulness assumption 
described in Fine Point 6.2. Under faithfulness, A is conditionally independent of Y given B in Figure 6.5, A is not 
conditionally independent of Y given L in Figure 6.7, and A is not conditionally independent of Y given C in Figure 
6.8. The d-separation rules (‘d-’ stands for directional) to infer associational statements from causal diagrams were 
formalized by Pearl (1995). An equivalent set of graphical rules, known as “moralization”, was developed by Lauritzen 
et al. (1990). 





Here we focus on positivity and consistency. 
Positivity is roughly translated into graph language as the condition that 


A more precise discussion of posi- the arrows from the nodes L to the treatment node A are not deterministic. 
tivity in causal graphs is given by The first component of consistency—well-defined interventions—means that 
Richardson and Robins (2013). the arrow from treatment A to outcome Y corresponds to a possibly hypothet- 


ical but relatively unambiguous intervention. In the causal diagrams discussed 
in this book, positivity is implicit unless otherwise specified, and consistency 
is embedded in the notation because we only consider treatment nodes with 
relatively well-defined interventions. Positivity is concerned with arrows into 
the treatment nodes, and well-defined interventions are only concerned with 
arrows leaving the treatment nodes. 

Thus, the treatment nodes are implicitly given a different status compared 
with all other nodes. Some authors make this difference explicit by including 
decision nodes in causal diagrams. Though this decision-theoretic approach 
largely leads to the same methods described here, we do not include decision 
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Fine Point 6.2 


Faithfulness. In a causal DAG the absence of an arrow from A to Y indicates that the sharp null hypothesis of no 
causal effect of A on any individual's Y holds, and an arrow A — Y (as in Figure 6.2) indicates that A has a causal 
effect on the outcome Y of at least one individual in the population. Thus, we would generally expect that, under 
Figure 6.2, the average causal effect of A on Y, Pr[Y°=! = 1] 4 Pr[Y*~° = 1], and the association between A and 
Y, Pr[Y = 1|A = 1] Æ Pr[Y = 1|A = 0], are not null. However, that is not necessarily true: a setting represented by 
Figure 6.2 may be one in which there is neither an average causal effect nor an association. For an example, remember 
the data in Table 4.1. Heart transplant A increases the risk of death Y in women (half of the population) and decreases 
the risk of death in men (the other half). Because the beneficial and harmful effects of A perfectly cancel out, the 
average causal effect is null, Pr[Y¢=! = 1] = Pr[Y°=? = 1]. Yet Figure 6.2 is the correct causal diagram because 
treatment A affects the outcome Y of some individuals—in fact, of all individuals—in the population. 

Formally, faithfulness is the assumption that, for three disjoint sets A, B, C on a causal DAG, (where C may be the 
empty set), A independent of B given C implies A is d-separated from B given C. When, as in our example, the causal 
diagram makes us expect a non-null association that does not actually exist in the data, we say that the joint distribution 
of the data is not faithful to the causal DAG. In our example the unfaithfulness was the result of effect modification 
(by sex) with opposite effects of exactly equal magnitude in each half of the population. Such perfect cancellation of 
effects is rare, and thus we will assume faithfulness throughout this book. Because unfaithful distributions are rare, in 
practice lack of d-separation (See Fine Point 6.1) can be equated to non-zero association. 

There are, however, instances in which faithfulness is violated by design. For example, consider the prospective 
study in Section 4.5. The average causal effect of A on Y was computed after matching on L. In the matched 
population, L and A are not associated because the distribution of L is the same in the treated and the untreated. 
That is, individuals are selected into the matched population because they have a particular combination of values of 
L and A. The causal diagram in Figure 6.9 represents the setting of a matched study in which selection S (1: yes, 
0: no) is determined by both A and L. The box around S indicates that the analysis is restricted to those selected 
into the matched cohort (S = 1). According to d-separation rules, there are two open paths between A and L when 
conditioning on S: L — A and L — S + A. Thus one would expect L and A to be associated conditionally on S. 
However, matching ensures that L and A are not associated (see Chapter 4). Why the discrepancy? Matching creates 
an association via the path L — S < A that is of equal magnitude, but opposite direction, as the association via the 
path L — A. The net result is a perfect cancellation of the associations. Matching leads to unfaithfulness. 

Finally, faithfulness may be violated when there exist deterministic relations between variables on the graph. Specif- 
ically, when two variables are linked by paths that include deterministic arrows, then the two variables are independent 
if all paths between them are blocked, but might also be independent even if some paths are open. In this book we 
will assume faithfulness unless we say otherwise. Faithfulness is also assumed when the goal of the data analysis is 
discovering the causal structure (see Fine Point 6.3) 








nodes in the causal diagrams presented in this chapter. Because we are always 

explicit about the potential interventions on the variable A, the additional 
Influence diagrams are causal di- nodes (to represent the potential interventions) would be somewhat redun- 
agrams augmented with decision dant. However, we will give a different status to treatment nodes when using 
nodes to represent the interventions SWIGs—causal diagrams with nodes representing counterfactual variables—in 
of interest (Dawid 2000, 2002). subsequent chapters. 


The different status of treatment nodes compared with other nodes was also 
graphically explicit in the causal trees introduced in Chapter 2, in which non- 
treatment branches corresponding to non-treatment variables L and Y were 
enclosed in circles, and in the “pies” representing sufficient causes in Chapter 
5, which distinguish between potential treatments A and E and background 
factors U. Also, our discussion on well-defined versions of treatment in Chapter 
3 emphasizes the requirements imposed on the treatment variables A that do 
not apply to other variables. 
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In contrast, the causal diagrams in this chapter apparently assign the same 
status to all variables in the diagram—this is indeed the case when causal dia- 
grams are considered as representations of nonparametric structural equations 
models (see Technical Point 6.2). The apparently equal status of all variables 
in causal diagrams may be misleading, especially when some of those variables 
are ill-defined. It may be okay to draw a causal diagram that includes a node 
for “obesity” as the outcome Y or even as a covariate L. However, for the rea- 
sons discussed in Chapter 3, it is generally not okay to draw a causal diagram 
that includes a node for “obesity” as a treatment A. In causal diagrams, nodes 
for treatment variables with multiple relevant versions need to be sufficiently 
well-defined. 

For example, suppose that we are interested in the causal effect of the com- 
pound treatment R, where R = 1 is defined as “exercising at least 30 minutes 
daily,” and R = 0 is defined as “exercising less than 30 minutes daily.” Individ- 
uals who exercise longer than 30 minutes will be classified as R = 1, and thus 
each of the possible durations 30, 31, 32... minutes can be viewed as a different 
version of the treatment R = 1. For each individual with R = 1 in the study, 
the versions of treatment A(r = 1) can take values 30,31, 32,... indicating all 
possible durations of exercise greater or equal than 30 minutes. For each indi- 
vidual with R = 0 in the study A(r = 0) can take values 0,1, 2...,29 including 
all durations of less than 30 minutes. That is, per the definition of compound 
treatment, multiple values a(r) can be mapped onto a single value R =r. 

Figure 6.10 shows how a causal diagram can appropriately depict a com- 
pound treatment R. The causal diagram also include nodes for the treatment 
versions A—a vector including all the variables A(r)—, two sets of common 
causes L and W, and unmeasured variables U. Unlike other causal diagrams 
described in this chapter, the one in Figure 6.10 includes nodes (R and A) that 
are deterministically related. The multiple versions A are sufficiently specified 
when, as in Figure 6.10, there are no direct arrows from R to Y. 

Being explicit about the compound treatment R of interest and its ver- 
sions A(r) is an important step towards having a well-defined causal effect, 
identifying relevant data, and choosing adjustment variables. 


6.5 A structural classification of bias 


The word “bias” is frequently used by investigators making causal inferences. 
There are several related, but technically different, uses of the term “bias” (see 
Chapter 10). We say that there is systematic bias when the data are insufficient 
to identify—compute—the causal effect even with an infinite sample size. (In 
this chapter, due to the assumption of an infinite sample size, bias refers to 
systematic bias.) Informally, we often refer to systematic bias as any structural 
association between treatment and outcome that does not arise from the causal 
effect of treatment on outcome in the population of interest. Because causal 
diagrams are helpful to represent different sources of association, we can use 
causal diagrams to classify systematic bias according to its source, and thus to 
sharpen discussions about bias. 

Take the crucial source of bias that we have discussed in previous chapters: 
lack of exchangeability between the treated and the untreated. For the average 
causal effect in the entire population, we say that there is (unconditional) bias 
when Privy o>) = 1] — Pry? = 1] Pay = 1|A = 1] — Pr[Y = 1 A = 0], 
which is the case when (unconditional) exchangeability Y*1LA does not hold. 
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Fine Point 6.3 


Discovery of causal structure. In this book we use causal diagrams as a way to represent our expert knowledge—or 
assumptions—about the causal structure of the problem at hand. That is, the causal diagram guides the data analysis. 
How about going in the opposite direction? Can we learn the causal structure by conducting data analyses without 
making assumptions about the causal structure? The process of learning components of the causal structure through 
data analysis is referred to as discovery (Spirtes et al., 2000). 

We now briefly discuss causal discovery under the assumption that the observed data arose from an unknown 
causal DAG that includes, in addition to the observed variables, an unknown number of unobserved variables U. Causal 
discovery requires that we assume faithfulness so that statistical independencies in the observed data distribution imply 
missing causal arrows on the DAG. Even assuming faithfulness, discovery is often impossible. For example, suppose 
that we find a strong association between two variables B and Č in our data. We cannot learn much about the causal 
structure involving B and C because their association is consistent with many causal diagrams: B causes C (B —> C), 
C causes B, (C — B), B and C share an unmeasured cause U (B —— U — C), Band C have an unobserved common 
effect U that has been conditioned on, and various combinations. If we knew the time sequence of B and C, we could 
only rule out causal diagrams with either B — C (if C predates B) or C — B (if B predates C ). 

There are, however, some settings in which learning causal structure from data appears possible. Suppose we 
have an infinite amount of data on 3 variables Z, A , Y and we know that their time sequence is Z first, A second, 
and Y last. Our data analysis finds that all 3 variables are marginally associated with each other, and that the only 
conditional independence that holds is ZILY|A. Then, if we are willing to assume that faithfulness holds, the only 
possible causal diagram consistent with our analysis is Z — A — Y with perhaps a common cause U of Z and A in 
addition to (or in place of) the arrow from Z to A. This is because, if either Z was a parent of Y or shared a cause 
with Y, or an unmeasured common cause of A and Y was present, then Z and Y could not have been statistically 
independent given A (assuming faithfulness). Thus, to explain the marginal dependency of Y and A, there must be a 
causal arrow from A to Y. In summary, the causal DAG learned implies that Z is not a direct cause (parent) of Y, that 
no unmeasured common cause of A and Y exists, and that, in fact, the average causal effect of A on Y is identified by 
E[Y|A = 1] — E[Y |A = 0]. 

The problem is, of course, that we do not have an infinite sample size. Robins et al. (2003) showed that, due 
to sampling variability, there is no finite sample size at which results of independence tests can, with high probability, 
distinguish between the hypotheses “A is a cause of Y” and “A does not cause Y”. Therefore, if we impose no 
assumption beyond faithfulness on the unknown graph, we can never have confidence that we have discovered the 
presence or absence of a causal effect from data. See the book by Peters et al. (2017) for alternative approaches to 
causal discovery. 





Absence of (unconditional) bias implies that the association measure (e.g., 
When there is systematic bias, no associational risk ratio or difference) in the population is a consistent estimate 
estimator can be consistent. Re- of the corresponding effect measure (e.g., causal risk ratio or difference) in the 
view Chapter 1 for a definition of population. 


consistent estimator. Lack of exchangeability results in bias even when the null hypothesis of no 
causal effect of treatment on the outcome holds. That is, even if the treatment 
had no causal effect on the outcome, treatment and outcome would be associ- 
ated in the data. We then say that lack of exchangeability leads to bias under 
the null. In the observational study summarized in Table 3.1, there was bias 
under the null because the causal risk ratio was 1 whereas the associational 
For example, conditioning on some risk ratio was 1.26. Any causal structure that results in bias under the null will 
variables may cause bias under the also cause bias under the alternative (i.e.,when treatment does have a non-null 
alternative (i.e., off the null) but effect on the outcome). However, the converse is not true. 


not under the null, as described For the average causal effects within levels of L, we say that there is con- 
by Greenland (1977) and Hernán ditional bias whenever Pr[Y@=! = 1|L = l] — Pr[Y@=° = 1|L = J] differs from 
(2017). See also Chapter 18. PrfY = 1|L = l, A = 1] — Pr[Y = 1|L =1,A = 0] for at least one stratum 
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Another form of bias may also re- 
sult from (nonstructural) random 


variability. See Chapter 10. 


Graphical representation of causal effects 


l, which is generally the case when conditional exchangeability Y*1LA|L = 1 
does not hold for all a and J. 

So far in this book we have referred to lack of exchangeability multiple 
times. However, we have yet to explore the causal structures that generate 
lack of exchangeability. With causal diagrams added to our methodological 
arsenal, we will be able to describe how lack of exchangeability can result from 
two different causal structures: 


1. Common causes: When the treatment and outcome share a common 
cause, the association measure generally differs from the effect measure. 
Many epidemiologists use the term confounding to refer to this bias. 


2. Conditioning on common effects: This structure is the source of bias that 
many epidemiologists refer to as selection bias. 


Chapter 7 will focus on confounding bias due to the presence of common 
causes, and Chapter 8 on selection bias due to conditioning on common effects. 
Again, both are examples of bias under the null due to lack of exchangeability. 

Chapter 9 will focus on another source of bias: measurement error. So far 
we have assumed that all variables—treatment A , outcome Y, and covariates 
L— are perfectly measured. In practice, however, some degree of measurement 
error is expected. The bias due to measurement error is referred to as mea- 
surement bias or information bias. As we will see, some types of measurement 
bias also cause bias under the null. 

Therefore, in the next three chapters we turn our attention to the three 
types of systematic bias—confounding, selection, and measurement. These bi- 
ases may arise both in observational studies and in randomized experiments. 
The susceptibility to bias of randomized experiments may not be obvious from 
previous chapters, in which we conceptualized observational studies as some 
sort of imperfect randomized experiments, while only considering ideal random- 
ized experiments with no participants lost during the follow-up, all participants 
adhering to their assigned treatment, and unknown treatment assignment for 
both study participants and investigators. While our quasi-mythological char- 
acterization of randomized experiments was helpful for teaching purposes, real 
randomized experiments rarely look like that. The remaining chapters of Part 
I will elaborate on the sometimes fuzzy boundary between experimenting and 
observing. 

Before that, we take a brief detour to describe causal diagrams in the 
presence of effect modification. 


6.6 The structure of effect modification 


V 


aee 
A — >Y 


Figure 6.11 


Identifying potential sources of bias is a key use of causal diagrams: we can 
use our causal expert knowledge to draw graphs and then search for sources of 
association between treatment and outcome. Causal diagrams are less helpful 
to illustrate the concept of effect modification that we discussed in Chapter 4. 

Suppose heart transplant A was randomly assigned in an experiment to 
identify the average causal effect of A on death Y. For simplicity, let us assume 
that there is no bias, and thus Figure 6.2 adequately represents this study. 
Computing the effect of A on the risk of Y presents no challenge. Because 
association is causation, the associational risk difference Pr[Y = 1|A = 1] — 
Pr [Y = 1|A = 0] can be interpreted as the causal risk difference Pr[Y°=! = 
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Figure 6.12 
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Figure 6.13 
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Figure 6.14 
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Figure 6.15 


See VanderWeele and Robins 
(2007b) for a finer classification 
of effect modification via causal 
diagrams. 


1]— Pr[Y °=? = 1]. The investigators, however, want to go further because they 
suspect that the causal effect of heart transplant varies by the quality of medical 
care offered in each hospital participating in the study. Thus, the investigators 
classify all individuals as receiving high (V = 1) or normal (V = 0) quality of 
care, compute the stratified risk differences in each level of V as described in 
Chapter 4, and indeed confirm that there is effect modification by V on the 
additive scale. The causal diagram in Figure 6.11 includes the effect modifier 
V with an arrow into the outcome Y but no arrow into treatment A (which is 
randomly assigned and thus independent of V). Two important caveats. 

First, the causal diagram in Figure 6.11 would still be a valid causal diagram 
if it did not include V because V is not a common cause of A and Y. It is 
only because the causal question makes reference to V (i.e., what is the average 
causal effect of A on Y within levels of V?), that V needs to be included on the 
causal diagram. Other variables measured along the path between “quality of 
care” V and the outcome Y could also qualify as effect modifiers. For example, 
Figure 6.12 shows the effect modifier “therapy complications” N, which partly 
mediates the effect of V on Y. 

Second, the causal diagram in Figure 6.11 does not necessarily indicate the 
presence of effect modification by V. The causal diagram implies that both A 
and V affect death Y, but it does not distinguish among the following three 
qualitatively distinct ways that V could modify the effect of A on Y: 


1. The causal effect of treatment A on mortality Y is in the same direction 
(i.e., harmful or beneficial) in both stratum V = 1 and stratum V = 0. 


2. The direction of the causal effect of treatment A on mortality Y in stra- 
tum V = 1 is the opposite of that in stratum V = 0 (i.e., there is 
qualitative effect modification). 


3. Treatment A has a causal effect on Y in one stratum of V but no causal 
effect in the other stratum, e.g., A only kills individuals with V = 0. 


That is, valid causal graphs such as Figure 6.11 fail to distinguish between 
the above three different qualitative types of effect modification by V. 

In the above example, the effect modifier V had a causal effect on the 
outcome. Many effect modifiers, however, do not have a causal effect on the 
outcome. Rather, they are surrogates for variables that have a causal effect 
on the outcome. Figure 6.13 includes the variable “cost of the treatment” 
S (1: high, 0: low), which is affected by “quality of care” V but has itself 
no effect on mortality Y. An analysis stratified by S (but not by V) will 
generally detect effect modification by S even though the variable that truly 
modifies the effect of A on Y is V. The variable S is a surrogate effect modifier 
whereas the variable V is a causal effect modifier (see Section 4.2). Because 
causal and surrogate effect modifiers are often indistinguishable in practice, 
the concept of effect modification comprises both. As discussed in Section 4.2, 
some prefer to use the neutral term “heterogeneity of causal effects,” rather 
than “effect modification,” to avoid confusion. For example, someone might be 
tempted to interpret the statement “cost modifies the effect of heart transplant 
on mortality because the effect is more beneficial when the cost is higher” as an 
argument to increase the price of medical care without necessarily increasing 
its quality. 

A surrogate effect modifier is simply a variable associated with the causal 
effect modifier. Figure 6.13 depicts the setting in which such association is 
due to the effect of the causal effect modifier on the surrogate effect modifier. 
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Some intuition for the association 
between W and V in low-cost hos- 
pitals S = 0: suppose that low- 
cost hospitals that use mineral wa- 
ter need to offset the extra cost of 
mineral water by spending less on 
components of medical care that 
decrease mortality. Then use of 
mineral water would be inversely 
associated with quality of medical 
care in low-cost hospitals. 


Graphical representation of causal effects 


However, such association may also be due to shared common causes or con- 
ditioning on common effects. For example, Figure 6.14 includes the variables 
“place of residence” (1: Greece, 0: Rome) U and “passport-defined national- 
ity” P (1: Greece, 0: Rome). Place of residence U is a common cause of both 
quality of care V and nationality P. Thus P will behave as a surrogate effect 
modifier because P is associated with the causal effect modifier V. Another 
(admittedly silly) example to illustrate this issue: Figure 6.15 includes the 
variables “cost of care” S and “use of bottled mineral water (rather than tap 
water) for drinking at the hospital” W. Use of mineral water W affects cost 
S but not mortality Y in developed countries. If the study were restricted to 
low-cost hospitals (S = 0), then use of mineral water W would be generally 
associated with medical care V, and thus W would behave as a surrogate effect 
modifier. In summary, surrogate effect modifiers can be associated with the 
causal effect modifier by structures including common causes, conditioning on 
common effects, or cause and effect. 

Causal diagrams are in principle agnostic about the presence of interaction 
between two treatments A and E. However, causal diagrams can encode infor- 
mation about interaction when augmented with nodes that represent sufficient- 
component causes (see Chapter 5), i.e., nodes with deterministic arrows from 
the treatments to the sufficient-component causes. Because the presence of 
interaction affects the magnitude and direction of the association due to con- 
ditioning on common effects, these augmented causal diagrams are discussed 
in Chapter 8. 


Chapter 7 
CONFOUNDING 


Suppose an investigator conducted an observational study to answer the causal question “does one’s looking up to 
the sky make other pedestrians look up too?” She found an association between a first pedestrian’s looking up and 
a second one’s looking up. However, she also found that pedestrians tend to look up when they hear a thunderous 
noise above. Thus it was unclear what was making the second pedestrian look up, the first pedestrian’s looking 
up or the thunderous noise? She concluded the effect of one’s looking up was confounded by the presence of a 
thunderous noise. 

In randomized experiments treatment is assigned by the flip of a coin, but in observational studies treatment 
(e.g., a person’s looking up) may be determined by many factors (e.g., a thunderous noise). If those factors affect 
the risk of developing the outcome (e.g., another person’s looking up), then the effects of those factors become 
entangled with the effect of treatment. We then say that there is confounding, which is just a form of lack of 
exchangeability between the treated and the untreated. Confounding is often viewed as the main shortcoming of 
observational studies. In the presence of confounding, the old adage “association is not causation” holds even if the 
study population is arbitrarily large. This chapter provides a definition of confounding and reviews the methods 
to adjust for it. 


7.1 The structure of confounding 


The structure of confounding, the bias due to common causes of treatment 
and outcome, can be represented by using causal diagrams. For example, the 


—_— diagram in Figure 7.1 (same as Figure 6.1) depicts a treatment A, an outcome 

L — A—~ Y Y, and their shared (or common) cause L. This diagram shows two sources 
of association between treatment and outcome: 1) the path A — Y that 

Figure 7.1 represents the causal effect of A on Y, and 2) the path A — L — Y between 


A and Y that includes the common cause L. The path A — L — Y that links 
A and Y through their common cause L is an example of a backdoor path. 
If the common cause L did not exist in Figure 7.1, then the only path 
In a causal DAG, a backdoor path between treatment and outcome would be A — Y, and thus the entire asso- 
is a noncausal path between treat- ciation between A and Y would be due to the causal effect of A on Y. That 
ment and outcome that remains is, the associational risk ratio Pr [Y = 1]A = 1] / Pr [Y = 1|A = 0] would equal 
even if all arrows pointing from the causal risk ratio Pr ie. = 1] /Pr [yee = 1]; association would be cau- 
treatment to other variables (the sation. But the presence of the common cause L creates an additional source of 
descendants of treatment) are re- association between the treatment A and the outcome Y, which we refer to as 
moved. That is, the path has an confounding for the effect of A on Y. Because of confounding, the associational 
arrow pointing into treatment. risk ratio does not equal the causal risk ratio; association is not causation. 
Examples of confounding abound in observational research. Consider the 
following examples of confounding for the effect of various kinds of treatments 
on health outcomes: 


e Occupational factors: The effect of working as a firefighter A on the risk 
of death Y will be confounded if “being physically fit” L is a cause of 
both being an active firefighter and having a lower mortality risk. This 
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L —>A—~>Y 


U 
Figure 7.2 


Some authors prefer to replace the 
unmeasured common cause U (and 
the two arrows leaving it) by a bidi- 
rectional edge between the mea- 
sured variables that U causes. 


Figure 7.3 


Early statistical descriptions of con- 
founding were provided by Yule 
(1903) for discrete variables and by 
Pearson et al. (1899) for contin- 
uous variables. Yule described the 
association due to confounding as 
“fictitious”, “illusory”, and “appar- 
ent”. Pearson et al. (1899) re- 
ferred to it as a “spurious” corre- 
lation. However, there is nothing 
fictitious, illusory, apparent, or spu- 
rious about these associations. As- 
sociations due to common causes 
are quite real associations, though 
they cannot be causally interpreted 
as treatment effects. Or, in Yule’s 
words, they are associations “to 
which the most obvious physical 
meaning must not be assigned.” 


Confounding 


bias, depicted in the causal diagram in Figure 7.1, is often referred to as 
a healthy worker bias. 


e Clinical decisions: The effect of drug A (say, aspirin) on the risk of 
disease Y (say, stroke) will be confounded if the drug is more likely to 
be prescribed to individuals with certain condition L (say, heart disease) 
that is both an indication for treatment and a risk factor for the disease. 
Heart disease L is a risk factor for stroke Y because L has a direct causal 
effect on Y as in Figure 7.1 or, as in Figure 7.2, because both L and Y 
are caused by atherosclerosis U, an unmeasured variable. This bias is 
known as confounding by indication or channeling, the last term often 
being reserved to describe the bias created by patient-specific risk factors 
L that encourage doctors to use certain drug A within a class of drugs. 


e Lifestyle: The effect of behavior A (say, exercise) on the risk of Y (say, 
death) will be confounded if the behavior is associated with another be- 
havior L (say, cigarette smoking) that has a causal effect on Y and tends 
to co-occur with A. The structure of the variables L, A, and Y is depicted 
in the causal diagram in Figure 7.3, in which the unmeasured variable U 
represents the sort of personality and social factors that lead to both lack 
of exercise and smoking. Another frequent problem: subclinical disease 
U results both in lack of exercise A and an increased risk of clinical dis- 
ease Y. This form of confounding is often referred to as reverse causation 
when L is unknown. 


e Genetic factors: The effect of a DNA sequence A on the risk of developing 
certain trait Y will be confounded if there exists a DNA sequence L that 
has a causal effect on Y and is more frequent among people carrying A. 
This bias, also represented by the causal diagram in Figure 7.3, is known 
as linkage disequilibrium or population stratification, the last term often 
being reserved to describe the bias arising from conducting studies in a 
mixture of individuals from different ethnic groups. Thus the variable 
U can stand for ethnicity or other factors that result in linkage of DNA 
sequences. 


e Social factors: The effect of income at age 65 A on the level of disability 
at age 75 Y will be confounded if the level of disability at age 55 L affects 
both future income and disability level. This bias may be depicted by 
the causal diagram in Figure 7.1. 


e Environmental exposures: The effect of airborne particulate matter A on 
the risk of coronary heart disease Y will be confounded if other pollutants 
L whose levels co-vary with those of A cause coronary heart disease. This 
bias is also represented by the causal diagram in Figure 7.3, in which the 
unmeasured variable U represent weather conditions that affect the levels 
of all types of air pollution. 


In all these cases, the bias has the same structure: it is due to the presence 
of a cause (L or U) that is shared by the treatment A and the outcome Y, 
which results in an open backdoor path between A and Y. We refer to the 
bias caused by shared causes of treatment and outcome as confounding, and 
we use other names to refer to biases caused by structural reasons other than 
the presence of shared causes of treatment and outcome. For simplicity of 
presentation, we assume throughout this chapter that all nodes in the causal 
DAGs are perfectly measured, that there are no selection nodes S with a box 
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around them (that is, the data are a random sample from the population of 
interest), and that random variability is absent. Causal DAGs with selection 
nodes will be discussed in Chapter 8, and causal DAGs with mismeasured 
nodes in Chapter 9. Random variability is discussed in Chapter 10. 


7.2 Confounding and exchangeability 


See Greenland and Robins (1986, 
2009) for a detailed discussion on 
the relations between confounding 
and exchangeability. 


Under conditional exchangeability, 
E[Y°=!] = Eyes os 

>), E[Y|£ =1,A = 1] Pr [L = l] - 
X E[Y|£ = 1, A = 0] Pr [L = l]. 


Pearl (1995, 2000) proposed the 
backdoor criterion for nonparamet- 
ric identification of causal effects. 


We now link the concept of confounding, which we have defined using causal 
diagrams, with the concept of exchangeability, which we have defined using 
counterfactuals in earlier chapters. For simplicity of presentation throughout 
this chapter, suppose that positivity and consistency hold, and that all causal 
DAGs include perfectly measured nodes that are not conditioned on. 

When exchangeability Y° LA holds, as in a marginally randomized experi- 
ment in which all individuals have the same probability of receiving treatment, 
the average causal effect can be identified without adjustment for any vari- 
ables. For a binary treatment A, the average causal effect E[Y°='] — E[Y°=°] 
is calculated as the difference of conditional means E[Y|A = 1] — E[Y|A = 0]. 

When exchangeability Y*1LA does not hold but conditional exchangeabil- 
ity Y*1LA|L does, as in a conditionally randomized experiment in which the 
probability of receiving treatment varies across values of L, the average causal 
effect can also be identified. However, as we described in Chapter 2, iden- 
tification of the causal effect E[Y°=!] — E[Y*=°] in the population requires 
adjustment for the variables L via standardization or IP weighting. Also, as 
we described in Chapter 4, conditional exchangeability also allows the identifi- 
cation of the conditional causal effects E[Y°=!|L = 1] — E[Y°=°|L = l] for any 
value l via stratification. 

In practice, if we believe confounding is likely, a key question arises: can 
we determine whether there exists a set of measured covariates L for which 
conditional exchangeability holds? Answering this question is difficult because 
thinking in terms of conditional exchangeability Y° IL A|L is often not intuitive 
in complex causal systems. 

In this chapter, we will see that answering this question is possible if one 
knows the causal DAG that generated the data. To do so, suppose that we 
know the true causal DAG (for now, it doesn’t matter how we know it: perhaps 
we have sufficient subject-matter knowledge, or perhaps an omniscient god gave 
it to us). How does the causal DAG allow us to determine whether there exists 
a set of variables L for which conditional exchangeability holds? There are 
two main approaches: (i) the backdoor criterion applied to the causal DAG 
and (ii) the transformation of the causal DAG into a SWIG. Though the use 
of SWIGs is a more direct approach, it also requires a bit more machinery so 
we are going to first explain the backdoor criterion; we will describe the SWIG 
approach in Section 7.5. 

A set of covariates L satisfies the backdoor criterion if all backdoor paths 
between A and Y are blocked by conditioning on L and L contains no variables 
that are descendants of treatment A. Under faithfulness and a further condition 
discussed in Technical Point 7.1, conditional exchangeability Y°1LA|L holds if 
and only if L satisfies the backdoor criterion. (A simple proof of this fact will 
be given below based on SWIGs.) Hence, we can now answer any query we may 
have about whether, for a given set of covariates L, conditional exchangeability 
given L holds. Thus, by trying every subset of measured non-descendants of 
treatment, we can answer the question of whether conditional exchangeability 
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Technical Point 7.1 


Does conditional exchangeability imply the backdoor criterion? That L satisfies the backdoor criterion always 
implies conditional exchangeability given L, even in the absence of faithfulness. In the main text we also said that, 
given faithfulness, conditional exchangeability given L implies that L satisfies the backdoor criterion. This last sentence 
is true under an FFRCISTG model (see Technical Point 6.2). In contrast, under an NPSEM-IE model, conditional 
exchangeability can hold even if the backdoor criterion does not, as is the case in a causal DAG with nodes A, L, Y and 
arrows A — L, A — Y. In this book we always assume an FFRCISTG model and faithfulness, unless stated otherwise. 

This difference between causal models is due to the fact that the NPSEM-IE, unlike an FFRCISTG model, assumes 
cross-world independencies between counterfactuals. However a cross-world independence can never be verified, even in 
principle, by any randomized experiment, which was the very reason that Robins (1986, 1987) did not assume cross-world 
independence in his FFRCISTG model. For further discussion, see Chapter 22. 


holds for any subset. (In fact, algorithms exist that can greatly reduce the 
number of subsets that must be tried in order to answer the question.) 

Let us now relate the backdoor criterion (i.e., exchangeability) to confound- 
ing. The two settings in which the backdoor criterion is satisfied are 


1. No common causes of treatment and outcome. In Figure 6.2, there are no 
common causes of treatment and outcome, and hence no backdoor paths 
that need to be blocked. Then the set of variables that satisfies the back- 
door criterion is the empty set and we say that there is no confounding. 


2. Common causes of treatment and outcome but a subset L of measured 
non-descendants of A suffices to block all backdoor paths. In Figure 7.1, 
the set of variables that satisfies the backdoor criterion is L. Thus, we 
say that there is confounding, but that there is no residual confounding 
whose elimination would require adjustment for unmeasured variables 
(which, of course, is not possible). For brevity, we say that there is no 
unmeasured confounding. 


The first setting describes a marginally randomized experiment in which 
confounding is not expected because treatment assignment is solely deter- 
mined by the flip of a coin—or its computerized upgrade: the random number 
generator—and the flip of the coin cannot cause the outcome. That is, when the 
treatment is unconditionally randomly assigned, the treated and the untreated 
are expected to be exchangeable because no common causes exist or, equiva- 
lently, because there are no open backdoor paths. Marginal exchangeability, 
i.e., Y° ILA, is equivalent to no common causes of treatment and outcome. 

The second setting describes a conditionally randomized experiment in 
which the probability of receiving treatment is the same for all individuals 
with the same value of L but, by design, this probability varies across values of 
L. This experimental design guarantees confounding if L is (i) a risk factor for 
the outcome Y and (ii) either a cause of the outcome (as in Figure 7.1) or the 
descendant of an unmeasured cause of the outcome as in Figure 7.2. Hence, 
there are open backdoor paths. However, conditioning on the covariates L 
will block all backdoor paths and therefore conditional exchangeability, i.e., 
Y*1LA|L, will hold. We say that a set L of measured non-descendants of A 
is a sufficient set for confounding adjustment when conditioning on L blocks 
all backdoor paths—that is, the treated and the untreated are exchangeable 
within levels of L. 
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Take our heart transplant study, a conditionally randomized experiment, 
as an example. Individuals who received a transplant (A = 1) are different 
from the others (A = 0) because, had the treated remained untreated, their 
risk of death Y would have been higher than that of those that were actually 
untreated—the treated had a higher frequency of severe heart disease L, a 
common cause of A and Y. The presence of common causes of treatment 
and outcome implies that the treated and the untreated are not marginally 
exchangeable but are conditionally exchangeable given L. This second setting 
is also what one hopes for in observational studies in which many variables L 
have been measured. 

The backdoor criterion does not answer questions regarding the magnitude 
or direction of confounding. It is logically possible that some unblocked back- 
door paths are weak (e.g., if L does not have a large effect on either A or Y) 
and thus induce little bias, or that several strong backdoor paths induce bias 
in opposite directions and thus result in a weak net bias. Because unmeasured 
confounding is not an “all or nothing” issue, in practice, it is important to 
consider the expected direction and magnitude of the bias (see Fine Point 7.1). 


7.3 Confounding and the backdoor criterion 


U, 


Figure 7.4 


We now describe several examples of the application of the backdoor criterion 
to determine whether the causal effect of A on Y is identifiable and, if so, which 
variables are required to ensure conditional exchangeability. Remember that 
all causal DAGs in this chapter include perfectly measured nodes that are not 
conditioned on. 

In Figure 7.1 there is confounding because the treatment A and the outcome 
Y share the cause L, i.e., because there is an open backdoor path between A 
and Y through L. However, this backdoor path can be blocked by conditioning 
on L. Thus, if the investigators collected data on L for all individuals, there is 
no unmeasured confounding given L. 

In Figure 7.2 there is confounding because the treatment A and the outcome 
Y share the unmeasured cause U, i.e., there is a backdoor path between A and 
Y through U. (Unlike the variables L, A, and Y, the variable U was not 
measured by the investigators.) This backdoor path could be theoretically 
blocked, and thus confounding eliminated, by conditioning on U, had data on 
this variable been collected. However, this backdoor path can also be blocked 
by conditioning on L. Thus, there is no unmeasured confounding given L. 

In Figure 7.3 there is also confounding because the treatment A and the 
outcome Y share the cause U, and the backdoor path can also be blocked by 
conditioning on L. Therefore there is no unmeasured confounding given L. 

Now consider Figure 7.4. In this causal diagram there are no common 
causes of treatment A and outcome Y, and therefore there is no confounding. 
The backdoor path between A and Y through L (A — Uz L — U 
Y) is blocked because L is a collider on that path. Thus all the association 
between A and Y is due to the effect of A on Y: association is causation. For 
example, suppose A represents physical activity, Y cervical cancer, U} a pre- 
cancer lesion, L a diagnostic test (Pap smear) for pre-cancer, and U2 a health- 
conscious personality (more physically active, more visits to the doctor). Then, 
under the causal diagram in Figure 7.4, the effect of physical activity A on 
cancer Y is unconfounded and there is no need to adjust for L to compute either 
Pr[Y °=] or Pr[Y?=°] and thus to compute the causal effect in the population. 
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Fine Point 7.1 


The strength and direction of confounding bias. Suppose you conducted an observational study to identify the effect 
of heart transplant A on death Y and that you assumed no unmeasured confounding. A thoughtful critic says “the 
inferences from this observational study may be incorrect because of potential confounding due to cigarette smoking 
L.” A crucial question is whether the bias results in an attenuated or an exaggerated estimate of the effect of heart 
transplant. For example, suppose that the risk ratio from your study was 0.6 (heart transplant was estimated to reduce 
mortality during the follow-up by 40%) and that, as the reviewer suspected, cigarette smoking L is a common cause 
of A (cigarette smokers are less likely to receive a heart transplant) and Y (cigarette smokers are more likely to die). 
Because there are fewer cigarette smokers (L = 1) in the heart transplant group (A = 1) than in the other group 
(A = 0), one would have expected to find a lower mortality risk in the group A = 1 even under the null hypothesis of 
no effect of treatment A on Y. Adjustment for cigarette smoking will therefore move the effect estimate upwards (say, 
from 0.6 to 0.7). In other words, lack of adjustment for cigarette smoking resulted in an exaggeration of the beneficial 
average causal effect of heart transplant. 

An approach to predict the direction of confounding bias is the use of signed causal diagrams. Consider the causal 
diagram in Figure 7.1 with dichotomous L, A, and Y variables. A positive sign over the arrow from L to A is added if 
L has a positive average causal effect on A (i.e., if the probability of A = 1 is greater among those with L = 1 than 
among those with L = 0), otherwise a negative sign is added if L has a negative average causal effect on A (i.e., if the 
probability of A = 1 is greater among those with L = 0 than among those with L = 1). Similarly a positive or negative 
sign is added over the arrow from L to Y. If both arrows are positive or both arrows are negative, then the confounding 
bias is said to be positive, which implies that effect estimate will be biased upwards in the absence of adjustment for 
L. |f one arrow is positive and the other one is negative, then the confounding is said to be negative, which implies 
that the effect estimate will be biased downwards in the absence of adjustment for L. Unfortunately, this simple rule 
may fail in more complex causal diagrams or when the variables are non dichotomous. See VanderWeele, Hernán, and 
Robins (2008) for a more detailed discussion of signed diagrams in the context of average causal effects. 

Regardless of the sign of confounding, another key issue is the magnitude of the bias. Biases that are not large 
enough to affect the conclusions of the study may be safely ignored in practice, whether the bias is upwards or down- 
wards. A large confounding bias requires a strong confounder-treatment association and a strong confounder-outcome 
association (conditional on the treatment). For discrete confounders, the magnitude of the bias depends also on preva- 
lence of the confounder (Cornfield et al. 1959, Walker 1991). If the confounders are unknown, one can only guess what 
the magnitude of the bias is. Educated guesses can be organized by conducting sensitivity analyses (i.e., repeating the 
analyses under several assumptions regarding the magnitude of the bias), which may help quantify the maximum bias 
that is reasonably expected. See Greenland (1996a), Robins, Rotnitzky, and Scharfstein (1999), Greenland and Lash 
(2008), and VanderWeele and Arah (2011) for detailed descriptions of sensitivity analyses for unmeasured confounding. 


Suppose, as in the last four examples, that data on L, A, and Y suffice to 


An informal definition for Figures 
7.1 to 7.4: ‘A confounder is any 
variable that can be used to adjust 
for confounding.’ Note this defini- 
tion is not circular because we have 
previously provided a definition of 
confounding. Another example of 
a non-circular definition: “A musi- 
cian is a person who plays music,” 
stated after we have defined what 
music is. 


identify the causal effect. In such setting we define L to be a confounder if 
the data on A and Y do not suffice for identification (i.e., we have structural 
confounding). We define L to be a non-confounder if data on A, Y alone suffice 
for identification. These definitions are equivalent to defining L as a confounder 
if there is conditional exchangeability but not unconditional exchangeability 
(i.e., structural confounding) and as a non-confounder if there is unconditional 
exchangeability. 


Thus, in Figures 7.1-7.3, L is a confounder because Pr[Y*% = 1] is identified 
by the standardized risk X, Pr[Y = 1|A = a, L = l] Pr [L = l]. In Figures 7.2 
and 7.3, L is not a common cause of A and Y, yet we still say that L is a 
confounder because it is needed to block the open backdoor path attributable 
to the unmeasured common cause U of A and Y. In Figure 7.4, L is a non- 
confounder and the identifying formula for Pr|Y®Ħ = 1] is just the conditional 
mean Pr[Y = 1|A = aj. 
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The possibility of identification 
of unconditional effects without 
identification of conditional effects 
was non-graphically demonstrated 
by Greenland and Robins (1986). 
The conditional bias in Figure 7.4 
was described by Greenland et 
al. (1999) and referred to as M- 
bias (Greenland 2003) because the 
structure of the variables involved 
in it—U >, L, Uj —resembles a letter 
M lying on its side. 


If U, caused U2, or Uz caused Uj, 
or an unmeasured U3 caused both, 
there would exist a common cause 
of A and Y, and we would have nei- 
ther unconditional nor conditional 
exchangeability given L. 


The definition of collider is path- 
specific: L is a collider on the path 
A+ U — L — U > Y, but not 
on the path A — L- U > Y. 





Figure 7.5 


Figure 7.6 


Interestingly, in Figure 7.4, conditional exchangeability given L does not 
hold and thus the counterfactual risks Pr[Y* = 1|Z = l] are not equal to 
the stratum-specific risks Pr[Y = 1]A = a, L = l], and the conditional treat- 
ment effects with strata of L are not identified. Further, adjustment for L via 
standardization X, Pr[Y = 1|A = a, L = l] Pr [L = l] gives a biased estimate 
of Pr[Y °]. This follows from the fact that adjustment for L would induce bias 
because conditioning on the collider L opens the backdoor path between A 
and Y (A — Uz — L — U; — Y), which was previously blocked by the col- 
lider itself. Thus the association between A and Y would be a mixture of the 
association due to the effect of A on Y and the association due to the open 
backdoor path. Association would not be causation any more. This is the first 
example we have seen for which unconditional exchangeability holds but con- 
ditional exchangeability does not: the average causal effect is identified, but 
generally not the conditional causal effects within levels of L. We refer to the 
resulting bias in the conditional effect as selection bias because it it arises from 
selecting (conditioning) on the common effect L of two marginally independent 
variables U and U2, one of which is associated with A and the other with Y 
(see Chapter 8). 

The causal diagram in Figure 7.5 is a variation of the one in Figure 7.4. 
The difference is that, in Figure 7.5, there is an arrow L — A. The presence 
of this arrow creates an open backdoor path A — L + U, — Y because U1 
is a common cause of A and Y, and so confounding exists. Conditioning on 
L would block that backdoor path but would simultaneously open a backdoor 
path on which L is a collider (A — Uz —> L—U, > Y). 

Therefore, in Figure 7.5, the bias is intractable: attempting to block the 
confounding path opens a selection bias path. There is neither unconditional 
exchangeability nor conditional exchangeability given L. A solution to the bias 
in Figure 7.5 would be to measure either (i) a variable Lı between U; and either 
Aor Y, or (ii) a variable Lə between U2 and either A or L. In the first case we 
would have conditional exchangeability given Lı. In the second case we would 
have conditional exchangeability given both Lə and L. For example, Figure 
7.6 includes the variable Lı between U; and Y and the variable Ly between 
Uz and A. See Fine Point 7.2 for a discussion of identification of causal effects 
depending on what variables are measured in Figure 7.6. 

The causal diagrams in this section depict two structural sources of lack of 
exchangeability that are due to the presence of open backdoor paths between 
treatment and outcome. The first source is the presence of common causes 
of treatment and outcome—which creates an open backdoor path. The sec- 
ond source is conditioning on a common effect—which may open a previously 
blocked backdoor path. For pedagogic purposes, we have reserved the term 
“confounding” for the first and “selection bias” for the latter. An alterna- 
tive way to structurally define confounding could be the “bias due to an open 
backdoor path between A and Y.” This alternative definition is identical to 
ours except that it labels the bias due to conditioning on L in Figure 7.4 as 
confounding rather than as selection bias. The alternative definition can be 
equivalently expressed as follows: confounding is “any systematic bias that 
would be eliminated by randomized assignment of A”. To see this, note that 
the bias induced in Figure 7.4 by conditioning on L could not occur in an 
experiment in which treatment A is randomly assigned because the random 
assignment ensures the absence of an unmeasured U» that is a common cause 
of A and L and thus conditioning on L would no longer open a backdoor path. 

One interesting distinction between these two definitions is the following. 
The existence of a common cause of treatment and the outcome (the structural 
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Fine Point 7.2 


Identification of conditional and unconditional effects. Under any causal diagram, the causal effects that can be 
identified depend on the variables that are measured in addition to the treatment and the outcome. Take Figure 7.6 as 
an example. If we measure only Lə (but not L and Lı), we have neither unconditional nor conditional exchangeability 
given Lə, and no causal effects can be identified. If we measure Lə and L, we have conditional exchangeability given 
Lo and L, but we do not have conditional exchangeability given either Lə alone or L alone. However, we can identify: 


e The conditional causal effects within joint strata of Lə and L. The identifying formula for each of the counterfactual 
means is E[Y|A =a, L = l, Lə = ly]. 


e The unconditional causal effect. The identifying formula for each of the counterfactual means is 
Vip ELY|A =a, L = 1, Le = Io] Pr [L = 1, La = lo]. 


e The conditional causal effects within strata of L. The identifying formula for each of the counterfactual means is 
Dan EIVYIA = 4, L = 1, Lo = by] Pr [L2 = | L = I]. 


e The conditional causal effects within strata of Do. The identifying formula for each of the counterfactual means 
is 0, E[Y|A =a, L = l, Lə = lə] Pr [L = I Le = lə]. 


If we only measure Lı, then we have conditional exchangeability given Lı so we can identify the conditional causal 
effects within strata of Lı and the unconditional causal effect. If we measure Lı and L, then we can also identify the 
conditional causal effects within joint strata of Lı and L, and within strata of L alone. If we measure L, Lı, and Lo, 
then we can also identify the conditional effects within joint strata of all three variables. 





definition of confounding) is a substantive fact about the study population 
and the world, independent of the method chosen to analyze the data. On 
the other hand, the definition of confounding as any bias that would have been 
eliminated by randomization implies that the existence of confounding depends 
on the method of analysis. In Figure 7.4, we have no confounding if we do not 
adjust for L, but we introduce confounding if we do adjust. 

Nonetheless, the choice of one definition over the other is just a matter of 
taste with no practical implications as all our conclusions regarding identifiabil- 
ity are based solely on whether conditional and/or unconditional exchangeabil- 
ity holds and not on our definition of confounding. The next chapter provides 
more detail on the distinction between structural confounding and selection 
bias. 


7.4 Confounding and confounders 


In the previous section, we have described how to use causal diagrams to 
decide whether confounding exists and, if so, to identify whether a given set 
of measured variables L is a sufficient set for confounding adjustment. The 
procedure requires a priori knowledge of the causal DAG that includes all 
causes—both measured and unmeasured—shared by the treatment A and the 
outcome Y. Once the causal diagram is known, we simply need to apply the 
backdoor criterion to determine what variables need to be adjusted for. 

In contrast, the traditional approach to handle confounding was based 
mostly on observed associations rather than on prior causal knowledge. The 
traditional approach first labels variables that meet certain (mostly) associa- 


7.4 Confounding and confounders 


Technically, investigators do not 
need structural knowledge. They 
only need to know a set of vari- 
ables that guarantees conditional 
exchangeability. However, ac- 
quring the structural knowledge— 
and therefore drawing the causal 
diagram—is arguably the most nat- 
ural approach to reason about con- 
ditional exchangeability. 


U 
Figure 7.7 
ye 
U 
Figure 7.8 


91 


tional conditions as confounders and then mandates that these so-called con- 
founders are adjusted for in the analysis. Confounding is said to exist when 
the adjusted estimate differs from the unadjusted estimate. 

Under the traditional approach, a confounder was defined as a variable that 
meets the following three conditions: (1) it is associated with the treatment, 
(2) it is associated with the outcome conditional on the treatment (with “con- 
ditional on the treatment” often replaced by “in the untreated”), and (3) it 
does not lie on a causal pathway between treatment and outcome. However, 
this traditional approach may lead to inappropriate adjustment. To see why, 
let us revisit Figures 7.1-7.4. 

In Figure 7.1, the variable L is associated with the treatment (because it 
has a causal effect on A), is associated with the outcome conditional on the 
treatment (because it has a direct causal effect on Y), and it does not lie 
on the causal pathway between treatment and outcome. In Figure 7.2, the 
variable L is associated with the treatment (because it has a causal effect on 
A), is associated with the outcome conditional on the treatment (because it 
shares the cause U with Y), and it does not lie on the causal pathway between 
treatment and outcome. In Figure 7.3, L is associated with the treatment (it 
shares the cause U with A), is associated with the outcome conditional on 
the treatment (it has a causal effect on Y), and it does not lie on the causal 
pathway between treatment and outcome. 

Therefore, according to the traditional approach, L is a confounder in the 
settings represented by Figures 7.1-7.3 and it needs be adjusted for. That was 
also our conclusion when using the backdoor criterion in the previous section. 
For Figures 7.1-7.3, there is no discrepancy between the traditional, mostly 
associational approach and the application of the backdoor criterion to the 
causal diagram. 

Now consider Figure 7.4 again in which there is no confounding and L is a 
non-confounder by the definition given in Section 7.3. However, L meets the 
criteria for a traditional confounder: it is associated with the treatment (it 
shares the cause U2 with A), it is associated with the outcome conditional on 
the treatment (it shares the cause U; with Y), and it does not lie on the causal 
pathway between treatment and outcome. Hence, according to the traditional 
approach, L is a confounder that should be adjusted for, even in the absence 
of confounding! But, as we saw above, adjustment for L results in a biased 
estimator of the causal effect in the population due to selection bias. Figure 
7.7 is another example in which the traditional approach leads to inappropriate 
adjustment for L by inducing selection bias. 

These examples show that associational or statistical criteria are insufficient 
to characterize confounding. An approach based on a definition of confounder 
that relies almost exclusively on statistical considerations may lead, as shown 
by Figures 7.4 and 7.7, to the wrong advice: adjust for a “confounder” even 
when structural confounding does not exist. To eliminate this problem for 
Figure 7.4, a follower of the traditional approach might replace the associational 
condition “(2) it is associated with the outcome conditional on the treatment” 
by the structural condition “(2) it is a cause of the outcome.” This modified 
definition of confounder prevents inappropriate adjustment for L in Figure 7.4, 
but only to create a new problem by not considering L a confounder—that 
needs to be adjusted for—in Figure 7.2. See Technical Point 7.2. 

The traditional approach misleads investigators into adjusting for variables 
when adjustment is harmful. The problem arises because the traditional ap- 
proach starts by defining confounders in the absence of sufficient causal knowl- 
edge about the sources of confounding, and then mandates adjustment for 
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Fine Point 7.3 


Surrogate confounders. Under the causal DAG in Figure 7.8, there is confounding for the effect of A on Y because 
of the presence of the unmeasured common cause U. The measured variable L is a proxy or surrogate for U. For 
example, the unmeasured variable socioeconomic status U may confound the effect of physical activity A on the risk 
of cardiovascular disease Y. Income L is a surrogate for the often ill-defined variable socioeconomic status. Should 
we adjust for the variable L? On the one hand, it can be said that L is not a confounder because it does not lie on 
a backdoor path between A and Y. On the other hand, adjusting for the measured L, which is associated with the 
unmeasured U, may indirectly adjust for some of the confounding caused by U. In the extreme, if L were perfectly 
correlated with U then it would make no difference whether one conditions on L or on U. Indeed if L is binary and is 
a nondifferentially misclassified (see Chapter 9) version of U, conditioning on L will result in a partial blockage of the 
backdoor path A — U — Y under some weak conditions (Greenland 1980, Ogburn and VanderWeele 2012). Therefore 
we will typically prefer to adjust, rather than not to adjust, for L. 

We refer to variables that can be used to reduce confounding bias even though they are not on a backdoor path (and 
so could never completely eliminate confounding) as surrogate confounders. A possible strategy to fight confounding is 
to measure as many surrogate confounders as possible and adjust for all of them. See Chapter 18 for discussion. 





those so-called confounders. If the adjusted and unadjusted estimates dif- 
fer, the traditional approach declares the existence of confounding. However, 
change in estimates may occur for reasons other than confounding, including 
selection bias when adjusting for non-confounders (see Chapter 8) and the use 
of noncollapsible effect measures (see Fine Point 4.3). Attempts to define con- 
founding based on change in estimates have been long abandoned because of 
these problems. 

In contrast, a structural approach starts by explicitly identifying the sources 
of confounding—the common causes of treatment and outcome that, were they 
all measured, would be sufficient to adjust for confounding—and then identifies 
a sufficient set of adjustment variables. 

The structural approach makes clear that including a particular variable 
in a sufficient set depends on the variables already included in the set. For 
example, in Figures 7.2 and 7.3 the set of variables L is needed to block a 
backdoor path because the set of variables U is not measured. We could then 
say that the variables in L are confounders. However, if the variables U had 
been measured and used to block the backdoor path, then the variables L 
would not be confounders given U (see also Fine Point 7.3). Given a causal 

VanderWeele and Shpitser (2013) DAG, confounding is an absolute concept whereas confounder is a relative one. 

also proposed a formal definition of A structural approach to confounding emphasizes that causal inference from 

confounder. observational data requires a priori causal knowledge. This causal knowledge 
is summarized in a causal DAG that encodes the researchers’ beliefs or as- 
sumptions about the causal network. Of course, there is no guarantee that the 
researchers’ causal DAG is correct and thus it is possible that, contrary to the 
researchers’ beliefs, their chosen set of adjustment variables fails to eliminate 
confounding or introduces selection bias. However, the structural approach 
to confounding has two important advantages. First, it prevents inconsisten- 
cies between beliefs and actions. For example, if you believe Figure 7.4 is the 
true causal diagram—and therefore that there is no confounding for the effect 
of A on Y—then you will not adjust for the variable L, regardless of what 
non-structural definitions of confounder may say. Second, the researchers’ as- 
sumptions about confounding become explicit and therefore can be explicitly 
criticized by other investigators. 


7.5 Single-world intervention graphs 93 





Technical Point 7.2 


Fixing the traditional definition of confounder. Figures 7.4 and 7.7 depict two graphical examples in which the 
traditional non-graphical definition of confounder and confounding misleads investigators into adjusting for a variable 
when adjustment for such variable is not only superfluous but also harmful. The traditional definition fails because it 
relies on two incorrect statistical criteria—conditions (1) and (2)—and one incorrect causal criterion—condition (3). To 
“fix” the traditional definition one needs to do two things: 


1. Replace condition (3) by the condition that “there exist variables L and U such that there is conditional exchange- 
ability within their joint levels Y° 1LA|L, U. This new condition is stronger than the earlier condition because it 
effectively implies that L is not on a causal pathway between A and Y and that E[Y°|L = l,U = u] is identified 
by E[Y|Z=1,U =u, A = aj. 


2. Replace conditions (1) and (2) by the following condition: U can be decomposed into two disjoint subsets Uy and 
Up (i.e., U = Ui U U2 and U; N U2 is empty) such that (i) U} and A are not associated within strata of L, and 
(ii) U2 and Y are not associated within joint strata of A, L, and U1. The variables in U, may be associated with 
the variables in U2. U; can always be chosen to be the largest subset of U that is unassociated with treatment. 


If these two new conditions are met we say U is a non-confounder given data on L. These conditions were 
proposed by Robins (1997, Theorem 4.3) and further discussed by Greenland, Pearl, and Robins (1999, pp. 45-46, note 
the condition that U = U; U Ua was inadvertently left out). These conditions overcome the difficulties found in Figures 
7.4 and 7.7 because they allow us to dismiss variables as non-confounders (Robins 1997). For example, Greenland, 
Pearl, and Robins applied these conditions to Figure 7.4 to show that there is no confounding. 


7.5 Single-world intervention graphs 


Exchangeability is translated into graph language as the lack of open paths 
between the treatment A and outcome Y nodes—other than those originating 
from A—that would result in an association between A and Y. Chapters 7- 
9 describe different ways in which lack of exchangeability can be represented 
in causal diagrams. For example, in this chapter we discuss confounding, a 
violation of exchangeability due to the presence of an open backdoor path 
between treatment and outcome. 

The equivalence between unconditional exchangeability Y*1LA and the 
backdoor criterion seems rather magical: there appears to be no obvious re- 
lationship between counterfactual independence and the absence of backdoor 
paths because counterfactuals are not included as variables on causal diagrams. 
Since graphs are so useful for evaluating independencies via d-separation, it 
seems natural to want to construct graphs that include counterfactuals as 
nodes, so that unconditional and conditional exchangeability can be directly 
read off the graph. 

A new type of graph—Single-world intervention graphs (SWIGs)— unify 
the counterfactual and graphical approaches by explicitly including the coun- 

Robins and Richardson (2013)  terfactual variables on the graph. A SWIG depicts the variables and causal 
showed that SWIGs overcome some relations that would be observed in a hypothetical world in which all individ- 
of the shortcomings of previously uals received treatment level a. That is, a SWIG is a graph that represents 
proposed twin causal diagrams a counterfactual world created by a single intervention. In contrast, the vari- 
(Balke and Pearl 1994). ables on a standard causal diagram represent the actual world. A SWIG can 
then be viewed as a function that transforms a given causal diagram under a 
given intervention. The following examples describe this transformation. 
Suppose the causal diagram in Figure 7.2 represents the observed study 
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Under an FFRCISTG model, it can 
be shown that d-separation also 
implies statistical independence on 
the SWIG. 


Confounding 


data. The SWIG in Figure 7.9 is a transformation of Figure 7.2 that represents 
a world in which all individuals have received an intervention that sets their 
treatment to the fixed value a. 

In the SWIG, the treatment node is split into left and right sides which are 
to be regarded as separate nodes (variables) once split. The right side encodes 
the treatment value a under the intervention and inherits all the arrows that 
were out of A in the original causal DAG. The left side encodes the value of 
treatment A that would have been observed in the absence of intervention, 
i.e., the natural value of treatment. It inherits all nodes that were into A on 
the causal DAG because its causal inputs are the same in the intervened on 
(counterfactual) world as in the actual world. Note that A does not have 
an arrow into a because the value a is the same for all individuals, i.e., is a 
constant in the intervened on world. 

We assume that the natural value of treatment A is well defined even though 
we are generally unable to measure it under intervention a. In some settings, 
though, A may be measurable: recent experiments suggest that electroen- 
cephalogram recordings can detect the choice individuals will make up to 1/2 
second before individuals becomes conscious of their decision. If so, A could 
actually be measured via electroencephalogram, while still leaving 1/2 second 
to intervene and give treatment a. 

In the SWIG, the outcome is Y“, the value of Y in the intervened on world. 
Because the remaining variables are temporally prior to A, they are not affected 
by the intervention and therefore take the same value as in the observed world. 
i.e., they are not labelled as a counterfactual variable. In fact, any variable 
that is a non-descendant of A need not be labelled as a counterfactual because, 
under the faithfulness assumption (which we make), treatment has no causal 
effect on its non-descendants for any individual. Under our causal model, 
conditional exchangeability Y“1LA|L holds because all paths between Y“% and 
A are blocked after conditioning on L, i.e., Y° and A are d-separated given L. 

Consider now the causal diagram in Figure 7.4 and the SWIG in Figure 
7.10. Marginal exchangeability Y° LA holds because, on the SWIG, all paths 
between Y° and A are blocked (without conditioning on L). In contrast, 
conditional exchangeability Y* LLA|L does not hold because, on the SWIG, the 
path Y“ — U, L — U2 A is open when the collider L is conditioned 
on. This is why the marginal A-Y association is causal, but the conditional A- 
Y association given L is not, and thus any method that adjusts for L results in 
bias. These examples show how SWIGs unify the counterfactual and graphical 
approaches. In fact it is straightforward to see that, on the SWIG, Y° is d- 
separated from A given L if and only if L is a non-descendant of A that blocks 
all backdoor paths from A to Y (see also Fine Point 7.4). 
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In the absence of randomization, causal inference relies on the uncheckable 
assumption that we have measured a set of variables L that is a sufficient 
set for confounding adjustment, that is, a set of non-descendants of treatment 
A that includes enough variables to block all backdoor paths from A to Y. 
Under this assumption of conditional exchangeability given L, standardization 
and IP weighting can be used to compute the average causal effect in the 
population. But, as discussed in Section 4.6, standardization and IP weighting 
are not the only available methods to adjust for confounding in observational 
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Fine Point 7.4 


Confounders cannot be descendants of treatment, but can be in the future of treatment. Consider the causal 
DAG in Figure 7.11. L is a descendant of treatment A that blocks all backdoor paths from A to Y. Unlike in Figures 
7.4 and 7.7, conditioning on L does not cause selection bias because no collider path is opened. Rather, because the 
causal effect of A on Y is solely through the intermediate variable L, conditioning on L completely blocks this pathway. 
This example shows that adjusting for a variable L that blocks all backdoor paths does not eliminate bias when L is a 
descendant of A. 

Since conditional exchangeability Y° 1LA|L implies that the adjustment for L eliminates all bias, it must be the case 
that conditional exchangeability fails to hold and the average treatment effect E[Y¢=!] — E[Y°=°] cannot be identified 
in this example. This failure can be verified by analyzing the SWIG in Figure 7.12, which depicts a counterfactual world 
in which A has been set to the value a. In this world, the factual variable L is replaced by the counterfactual variable L°, 
that is, the value of L that would have been observed if all individuals had received treatment value a. Since L% blocks 
all paths from Y* to A we conclude that Y°1LA|L*% holds, but we cannot conclude that conditional exchangeability 
Y*1LA|L holds as L is not even on the graph. (Under an FFRCISTG, any independence that cannot be read off the 
SWIG cannot be assumed to hold.) Therefore, we cannot ensure that the average treatment effect BY? ] — By eo] 
is identified from data on (L, A, Y). 

The problem arises because L is a descendant of A, not because L is in the future of A. If, in Figure 7.11, 
the arrow from A to L did not exist, then L would be a non-descendant of A that blocks all the backdoor paths. 
Analogously, on the SWIG in Figure 7.12, we can replace L° by L as A is no longer a cause of L (note Y“ and A are 
now d-separated by L). Therefore adjusting for L would eliminate all bias, even if L were still in the future of A. What 
matters is the topology of the causal diagram (which variables cause which variables), not the time sequence of the 
nodes. Rosenbaum (1984) and Robins (1986, section 11) give non-graphical discussions of the control of confounding 
by temporally post-treatment variables. 


studies. Methods that adjust for confounders L can be classified into two broad 
categories: 


A |a —> L! —> Y’ e G-methods: standardization, IP weighting, and g-estimation. These 
methods (the ‘g’ stands for ‘generalized.’) exploit conditional exchange- 
ability given L to estimate the causal effect of A on Y in the entire 
population or in any subset of the population. In our heart transplant 
study, we used g-methods to adjust for confounding by disease severity 
L in Sections 2.4 (standardization) and 2.5 (IP weighting). Part II de- 

Figure 7.12 scribes model-based extensions of g-methods: the parametric g-formula 
(standardization), IP weighting of marginal structural models, and g- 
estimation of nested structural models. 


G — 


Stratification-based methods: Stratification (including restriction) and 
matching. These methods exploit conditional exchangeability given L to 
estimate the association between A and Y in subsets defined by L. In our 
heart transplant study, we used stratification-based methods to adjust for 
confounding by disease severity L in Sections 4.4 (stratification) and 4.5 
A common variation of stratifica- (matching). Part II describes the model-based extension of stratification: 
tion and matching replaces each conventional outcome regression. 

individual's variables L by the in- 


dividual's estimated probability of G-methods simulate the A-Y association in the population if backdoor 


receiving treatment Pr [A = 1|L]: paths involving the measured variables L did not exist. For example, IP 
the propensity score (Rosenbaum weighting achieves this by creating a pseudo-population in which treatment 
and Rubin 1983). See Chapter 15. A is independent of the measured confounders L, that is, by “deleting” the 
arrow from L to A. In contrast, stratification-based methods do not delete the 
arrow from L to A but rather compute the conditional effect in a subset of the 
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Technically, g-estimation requires 
the slightly weaker assumption that 
the magnitude of unmeasured con- 
founding given L is known, of which 
the assumption of no unmeasured 
confounding is a particular case. 
See Chapter 14. 


A practical example of the ap- 
plication of expert knowledge of 
the causal structure to confounding 
evaluation was described by Hernán 
et al (2002). 
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observed population, which is represented by adding a selection box. The ad- 
vantage of “deleting” the arrow from confounders L to treatment A will become 
apparent when we discuss time-varying treatments in Part III. In settings with 
time-varying treatments, and therefore time-varying confounders, g-methods 
are the methods of choice to adjust for confounding because stratification-based 
methods may result in selection bias. The bias of stratification-based methods 
is described in Chapter 20. 


All the above methods require conditional exchangeability given L. How- 
ever, confounding can sometimes be handled by methods that do not require 
conditional exchangeability. Some examples of these methods are difference- 
in-differences (Technical Point 7.3), instrumental variable estimation (Chapter 
16), the front door criterion (Technical Point 7.4), and others. Unfortunately, 
these methods require alternative assumptions that, like conditional exchange- 
ability, are unverifiable. Therefore, in practice, the validity of the resulting 
effect estimates is not guaranteed. Also, these methods cannot be generally 
employed for causal questions involving time-varying treatments. As a result, 
these methods are disqualified from consideration for many research problems. 
For time-fixed treatment, the choice of adjustment method will depend on 
which unverifiable assumptions—either conditional exchangeability or the al- 
ternative conditions—are believed more likely to hold in a particular setting. 


Achieving conditional exchangeability may be an unrealistic goal in many 
observational studies but, as discussed in Section 3.2, expert knowledge about 
the causal structure can be used to get as close as possible to that goal. There- 
fore, in observational studies, investigators measure many variables L (which 
are non-descendants of treatment) in an attempt to ensure that the treated and 
the untreated are conditionally exchangeable. The hope is that, even though 
common causes may exist (confounding), the measured variables L are suf- 
ficient to block all backdoor paths (no unmeasured confounding). However, 
there is no guarantee that this attempt will be successful, which makes causal 
inference from observational data a risky undertaking. 


In addition, expert knowledge can be used to avoid adjusting for variables 
that may introduce bias. At the very least, investigators should generally 
avoid adjustment for variables affected by either the treatment or the outcome. 
Of course, thoughtful and knowledgeable investigators could believe that two 
or more causal structures, possibly leading to different conclusions regarding 
confounding and confounders, are equally plausible. In that case they would 
perform multiple analyses and explicitly state the assumptions about causal 
structure required for the validity of each. Unfortunately, one can never be 
certain that the set of causal structures under consideration includes the true 
one; this uncertainty is unavoidable with observational data. 


There is a scientific consequence to the always present threat of confound- 
ing in observational studies. Suppose you conducted an observational study 
to identify the effect of heart transplant A on death Y and that you assumed 
no unmeasured confounding given disease severity L. A critic of your study 
says “the inferences from this observational study may be incorrect because 
of potential confounding.” The critic is not making a scientific statement, but 
a logical one. Since the findings from any observational study may be con- 
founded, it is obviously true that those of your study can be confounded. If 
the critic’s intent was to provide evidence about the shortcomings of your 
particular study, he failed. His criticism is noninformative because he sim- 
ply restated a characteristic of observational research that you and the critic 
already knew before the study was conducted. 
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Technical Point 7.3 


Difference-in-differences and negative outcome controls. Suppose we want to compute the average causal effect 
of aspirin A (1: yes; 0: no) on blood pressure Y, but there are unmeasured common causes U of A and Y such 
as history of heart disease. Then we cannot compute the effect via standardization or IP weighting because there is 
unmeasured confounding. But there is an alternative method that, under some conditions, may adjust for the unmeasured 
confounding: the use of negative outcome controls (also known as “placebo tests” ). 

Suppose further that, for each individual in the population, we have also measured the value of the outcome right 
before treatment was available in the population. We refer to this pre-treatment outcome C as a negative outcome 
control. As depicted in Figure 7.13, U is a cause of both Y and C and treatment A is obviously not a cause of 
the pre-treatment outcome C’. Now, even though the causal effect of A on C is known to be zero, the contrast 
E[C|A = 1] — E [C| A = 0] is not zero because of confounding by U. In fact, E[C|A = 1] — E[C|A = 0] measures the 
magnitude of confounding for the effect of A on C on the additive scale. If the magnitude of additive confounding for 
the effect of A on the negative outcome control C is the same as for the effect of A on the true outcome Y, then 
we can compute the effect of A on Y in the treated. Specifically, under the assumption of additive equi-confounding 
E[Y°|A=1] —E[Y°|A = 0] = E[C|A = 1] — E[C|A = 0], the effect is 


E(Y!-Y°|4=1] =(B[Y|A=1] - E [Y|A =0]) - (Œ [C]A = 1] -E[C|A =) 


That is, the effect in the treated is equal to the association between treatment A and outcome Y (which is a mixture 
of the causal effect and confounding) minus the confounding as measured by the association between treatment A and 
the negative outcome control C. 

This method for confounding adjustment is known as difference-in-differences (Card 1990, Meyer et al. 1995, 
Angrist and Krueger 1999). In practice, the method is often combined with adjustment for measured covariates using 
parametric or semiparametric approaches (Abadie 2005). However, as explained by Sofer et al. (2016), the difference- 
in-differences method is a somewhat restrictive approach for using negative outcome controls: it requires measurement 
of the outcome both pre- and post-treatment (or at least that the true outcome Y and the C are measured on the same 
scale) and it requires additive equi-confounding. Sofer et al. (2016) describe more general methods that allow for Y 
and C to be on different scales, rely on weaker versions of equi-confounding, and incorporate adjustment for measured 
covariates. For a general introduction to the use of negative outcome controls to detect confounding, see Lipsitch et al. 
(2010) and Flanders et al. (2011). 


To appropriately criticize your study, the critic needs to engage in a truly 
scientific conversation. For example, the critic may cite experimental or obser- 
vational findings that contradict your findings, or he can say something along 

C A——Y the lines of “the inferences from this observational study may be incorrect 

because of potential confounding due to cigarette smoking, a common cause 

/ through which a backdoor path may remain open”. This latter option provides 

you with a testable challenge to your assumption of no unmeasured confound- 

ing. The burden of the proof is again yours. Your next move is to try and 

adjust for smoking. 

Though the above discussion was restricted to bias due to confounding, the 

absence of biases due to selection and measurement is also needed for valid 

A —, M—~yY causal inference from observational data. But, unlike confounding, these other 

biases may arise in both randomized experiments and observational studies. 

/ After having explored confounding in this chapter, the next chapter presents 

another potential source of lack of exchangeability between the treated and the 
untreated: selection of individuals into the analysis. 


Figure 7.13 


U 


Figure 7.14 
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Technical Point 7.4 


The front door criterion. The causal diagram in Figure 7.14 depicts a setting in which the treatment A and the 
binary outcome Y share an unmeasured cause U, and in which there is a variable M that fully mediates the effect of 
A on Y and that shares no unmeasured causes with either A or Y. Under this causal structure, a data analyst cannot 
directly use standardization (nor IP weighting) to compute the counterfactual risks Pr eo = 1] and Pr [yore = 1] 
because the variable U, which is necessary to block the backdoor path between A and Y, is not available. Therefore, 
the average causal effect of A on Y cannot be identified using the methods described in previous chapters. However, 
Pearl (1995) showed that Pr [Y°Ħ = 1] is identified by the so-called front door formula 








X Pr[M =mlA=a])_ Pr[¥Y 1|M = m, A = a'| Pr [A = a'] 


Pearl refers to this identification formula as front door adjustment because it relies on the existence of a path 
from A and Y that, contrary to a backdoor path, goes through a descendant M of A that completely mediates the 
effect of A on Y. Pearl often uses the term backdoor formula to refer to the identification formula that we refer to as 
standardization or, more generally, the g-formula (Robins 1986). 

A proof of the front door identification formula follows. Note that Pr[Y°® = 1] = 
Ym Pr[M° = m] Pr[Y* =1|M* =m] and that, under Figure 7.14, Pr[M° =m] = Pr[M = m]|A =q] 
because there is no confounding for the effect of A on M (i.e, AILM*), and Pr[Y* = 1|M° = m] 
Xa Pr[Y = 1|M =m, A = a']Pr[A = a']. To prove the last equality, first note that Pr [Y° = 1|M* = m] 
Pr[Y™ = 1] because (i) Y° = Y™ when M° = m (A affects Y only through M in Figure 7.14) and 
(ii) YILM* by d-separation on a SWIG under the joint intervention in which M is set to m and A 
to a. Finally, by conditional exchangeability Y™ 1LM|A on the SWIG where we intervene on M alone, 
Pr[Y™ = 1] = Xy Pr [Y = 1|M = m, A = d'] Pr [A = a']. 

The above proof requires well-defined counterfactual outcomes Y™ under interventions on M. We now provide a 
second proof in which we assume that only counterfactual outcomes Y° under interventions on A are well-defined. To 
do so, we reinterpret the causal DAG in Figure 7.14 as a statistical DAG and use the SWIG independence D° ULAJN, 
where D = (Y, M) and N = U are the descendants and non-descendants of A, respectively. Then Pr [Y* = y] = 
= $ a PUY? = y, M° = m,U = 4] 

Ym eu Pr[Y" = y, M° = m| A = a, U = u] Pr[U = u] (by exchangeability) 

Ym ea Pr[Y = y, M = m| A = a, U = u] Pr[U = u] (by consistency) 

Ym la Pr[Y = y|M = m, A =a, U = u] Pr[|M = m| A = a, U = u] Pr[U = u] 

Xm Pr[M = m|A = a] >>, Pr[Y = y|M = m, U = u] {X v Pr[U = u|A = a'] Pr[A = a']} 
by UILM|A and ALY|M,U 

Yn Pr[M = m|A =a] X) o {2 Pr[Y = y| M = m, A = d',U = u] Pr[U =u|M = m, A = a']} Pr[A = a'] 
by ULLM|A and ALLY|M,U 

=)", Pr[M = m]|A =a] X y Pr[Y = y|M = m, A = a'] Pr[A =a’). 


































































































Chapter 8 
SELECTION BIAS 


Suppose an investigator conducted a randomized experiment to answer the causal question “does one’s looking 
up to the sky make other pedestrians look up too?” She found a strong association between her looking up and 
other pedestrians’ looking up. Does this association reflect a causal effect? Well, by definition of randomized 
experiment, confounding bias is not expected in this study. However, there was another potential problem: The 
analysis included only those pedestrians that, after having been part of the experiment, gave consent for their data 
to be used. Shy pedestrians (those less likely to look up anyway) and pedestrians in front of whom the investigator 
looked up (who felt tricked) were less likely to participate. Thus participating individuals in front of whom the 
investigator looked up (a reason to decline participation) are less likely to be shy (an additional reason to decline 
participation) and therefore more likely to look up. That is, the process of selection of individuals into the analysis 
guarantees that one’s looking up is associated with other pedestrians’ looking up, regardless of whether one’s 
looking up actually makes others look up. 

An association created as a result of the process by which individuals are selected into the analysis is referred to 
as selection bias. Unlike confounding, this type of bias is not due to the presence of common causes of treatment and 
outcome, and can arise in both randomized experiments and observational studies. Like confounding, selection 
bias is just a form of lack of exchangeability between the treated and the untreated. This chapter provides a 
definition of selection bias and reviews the methods to adjust for it. 


8.1 The structure of selection bias 


The term “selection bias” encompasses various biases that arise from the pro- 
cedure by which individuals are selected into the analysis. Here we focus on 
bias that would arise even if the treatment had a null effect on the outcome, 
that is, selection bias under the null (as described in Section 6.5). The struc- 


A >Y x ture of selection bias can be represented by using causal diagrams like the one 
in Figure 8.1, which depicts dichotomous treatment A, outcome Y, and their 
Figure 8.1 common effect C. Suppose Figure 8.1 represents a study to estimate the effect 


of folic acid supplements A given to pregnant women shortly after conception 
on the fetus’s risk of developing a cardiac malformation Y (1: yes, 0: no) dur- 
ing the first two months of pregnancy. The variable C represents death before 
birth. A cardiac malformation increases mortality (arrow from Y to C), and 
folic acid supplementation decreases mortality by reducing the risk of malfor- 
mations other than cardiac ones (arrow from A to C). The study was restricted 
Pearl (1995) and Spirtes et al to fetuses who survived until birth. That is, the study was conditioned on no 
(2000) used causal diagrams to de- death C = 0 and hence the box around the node C. 
scribe the structure of bias resulting The diagram in Figure 8.1 shows two sources of association between treat- 
from selection of individuals. ment and outcome: 1) the open path A — Y that represents the causal effect 
of A on Y, and 2) the open path A — C — Y that links A and Y through 
their (conditioned on) common effect C. An analysis conditioned on C will 
generally result in an association between A and Y. We refer to this induced 
association between the treatment A and the outcome Y as selection bias due 
to conditioning on C. Because of selection bias, the associational risk ratio 
Pr[Y = 1|A = 1,C = 0]/Pr[Y = 1|A = 0,C = 0] does not equal the causal 
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Selection bias 


risk ratio Pr [Y°=1 = 1] / Pr [Y*~° = 1]; association is not causation. If the 
analysis were not conditioned on the common effect (collider) C, then the only 
open path between treatment and outcome would be A — Y, and thus the 
entire association between A and Y would be due to the causal effect of A on 
Y. That is, the associational risk ratio Pr[Y = 1]A = 1]/Pr[Y = 1|A = 0] 
would equal the causal risk ratio Pr [Y°=! = 1] / Pr [Y°~° = 1]; association 
would be causation. 

The causal diagram in Figure 8.2 shows another example of selection bias. 
This diagram includes all variables in Figure 8.1 plus a node S representing 
parental grief (1: yes, 0: no), which is affected by vital status at birth. Suppose 
the study was restricted to non grieving parents S = 0 because the others were 
unwilling to participate. As discussed in Chapter 6, conditioning on a variable 
S affected by the collider C also opens the path A—> C =- Y. 

Both Figures 8.1 and 8.2 depict examples of selection bias in which the bias 
arises because of conditioning on a common effect of treatment and outcome: 
C in Figure 8.1 and S in Figure 8.2. This bias arises regardless of whether there 
is an arrow from A to Y, that is, it is selection bias under the null. Remember 
that causal structures that result in bias under the null also cause bias when 
the treatment has a non-null effect. Both confounding due to common causes 
of treatment and outcome (see previous chapter) and selection bias due to 
conditioning on common effects of treatment and outcome are examples of 
bias under the null. However, selection bias under the null can be defined 
more generally as illustrated by Figures 8.3 to 8.6. 

Consider the causal diagram in Figure 8.3, which represents a follow-up 
study of HIV-positive individuals to estimate the effect of certain antiretroviral 
treatment A on the 3-year risk of death Y (to reduce clutter, there is no 
arrow from A to Y). The unmeasured variable U represents high level of 
immunosuppression (1: yes, 0: no). Individuals with U = 1 have a greater risk 
of death. Individuals who drop out from the study or are otherwise lost to 
follow-up are censored (C = 1). Individuals with U = 1 are more likely to be 
censored because the severity of their disease prevents them from participating 
in the study. The effect of U on censoring C is mediated by the presence of 
symptoms (fever, weight loss, diarrhea, and so on), CD4 count, and viral load 
in plasma, all included in L, which could or could not be measured. (The 
role of L, when measured, in data analysis is discussed in Section 8.5; in this 
section, we take L to be unmeasured.) Individuals receiving treatment are at a 
greater risk of experiencing side effects, which could lead them to dropout, as 
represented by the arrow from A to C. The square around C indicates that the 
analysis is restricted to individuals who remained uncensored (C = 0) because 
those are the only ones in which Y can be assessed. 

According to the rules of d-separation, conditioning on the collider C opens 
the path A— C + L + U — Y and thus association flows from treatment A 
to outcome Y, i.e., the associational risk ratio is not equal to 1 even though 
the causal risk ratio is equal to 1. Figure 8.3 can be viewed as a simple 
transformation of Figure 8.1: the association between Y and C resulting from 
a direct effect of Y on C in Figure 8.1 is now the result of U, a common 
cause of Y and C. Some intuition for this bias: If a treated individual with 
treatment-induced side effects (and thereby at a greater risk of dropping out) 
did in fact not drop out (C = 0), then it is generally less likely that a second 
independent cause of dropping out (e.g., U = 1) was present. Therefore, an 
inverse association between A and U would be expected in those who did 
not drop out (C = 0). Because U is positively associated with the outcome 
Y, restricting the analysis to individuals who did not drop out of this study 





8.2 Examples of selection bias 


Figures 8.5 and 8.6 show examples 
of M-bias. 


More generally, selection bias can 
be defined as the bias resulting from 
conditioning on the common ef- 
fect of two variables, one of which 
is either the treatment or associ- 
ated with the treatment, and the 
other is either the outcome or asso- 
ciated with the outcome (Hernán, 
Herndndez-Diaz, and Robins 2004). 
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induces an inverse association between A and Y. 

The bias in Figure 8.3 is an example of selection bias that results from 
conditioning on the censoring variable C, which is a common effect of treat- 
ment A and a cause U of the outcome Y, rather than of the outcome itself. 
We now present three additional causal diagrams that could lead to selection 
bias by differential loss to follow up. In Figure 8.4 prior treatment A has a 
direct effect on symptoms L. Restricting the study to the uncensored individ- 
uals again implies conditioning on the common effect C of A and U, thereby 
introducing an association between treatment and outcome. Figures 8.5 and 
8.6 are variations of Figures 8.3 and 8.4, respectively, in which there is a com- 
mon cause W of A and another measured variable. W indicates unmeasured 
lifestyle /personality /educational variables that determine both treatment (ar- 
row from W to A) and either attitudes toward attending study visits (arrow 
from W to C in Figure 8.5) or threshold for reporting symptoms (arrow from 
W to L in Figure 8.6). 

We have described some different causal structures, depicted in Figures 8.1- 
8.6, that may lead to selection bias. In all these cases, the bias is the result 
of selection on a common effect of two other variables in the diagram, i.e., a 
collider. We will use the term selection bias to refer to all biases that arise 
from conditioning on a common effect of two variables, one of which is either 
the treatment or a cause of treatment, and the other is either the outcome or 
a cause of the outcome. We now describe some examples of selection bias that 
share this structure. 


8.2 Examples of selection bias 


The distinction between the two 
structures leading to lack of ex- 
changeability is not universally 
made across disciplines. Lack 
of conditional exchangeability due 
to any cause is often referred as 
“weak ignorability” or “ignorable 
treatment assignment” in statis- 
tics (Rosenbaum and Rubin, 1983), 
“selection on observables” in the 
social sciences (Barnow et al., 
1980), and “ommitted variable 
bias” or “endogeneity” in econo- 
metrics (Imbens, 2004). 


Consider the following examples of bias due to the mechanism by which indi- 
viduals are selected into the analysis: 


e Differential loss to follow-up: This is precisely the bias described in the 
previous section and summarized in Figures 8.3-8.6. It is also referred to 
as bias due to informative censoring. 


e Missing data bias, nonresponse bias: The variable C in Figures 8.3-8.6 
can represent missing data on the outcome for any reason, not just as a 
result of loss to follow up. For example, individuals could have missing 
data because they are reluctant to provide information or because they 
miss study visits. Regardless of the reasons why data on Y are missing, 
restricting the analysis to individuals with complete data (C = 0) may 
result in bias. 


e Healthy worker bias: Figures 8.3-8.6 can also describe a bias that could 
arise when estimating the effect of an occupational exposure A (e.g., a 
chemical) on mortality Y in a cohort of factory workers. The underlying 
unmeasured true health status U is a determinant of both death Y and 
of being at work C (1: no, 0: yes). The study is restricted to individuals 
who are at work (C = 0) at the time of outcome ascertainment. (L 
could be the result of blood tests and a physical examination.) Being 
exposed to the chemical reduces the probability of being at work in the 
near future, either directly (e.g., exposure can cause disabling asthma), 
like in Figures 8.3 and 8.4, or through a common cause W (e.g., certain 
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Fine Point 8.1 


Selection bias in case-control studies. Figure 8.1 can be used to represent selection bias in a case-control study. 
Suppose a certain investigator wants to estimate the effect of postmenopausal estrogen treatment A on coronary heart 
disease Y. The variable C indicates whether a woman in the study population (the underlying cohort, in epidemiologic 
terms) is selected for the case-control study (1: no, 0: yes). The arrow from disease status Y to selection C indicates 
that cases in the population are more likely to be selected than noncases, which is the defining feature of a case-control 
study. In this particular case-control study, the investigator decided to select controls (Y = 0) preferentially among 
women with a hip fracture. Because treatment A has a protective causal effect on hip fracture, the selection of controls 
with hip fracture implies that treatment A now has a causal effect on selection C. This effect of A on C is represented 
by the arrow A — C. One could add an intermediate node F (representing hip fracture) between A and C, but that is 
unnecessary for our purposes. 

In a case-control study, the association measure (the treatment-outcome odds ratio) is by definition conditional 
on having been selected into the study (C = 0). If individuals with hip fracture are oversampled as controls, then the 
probability of control selection depends on a consequence of treatment A (as represented by the path from A to C’) and 
“inappropriate control selection” bias will occur. Again, this bias arises because we are conditioning on a common effect 
C of treatment and outcome. A heuristic explanation of this bias follows. Among individuals selected for the study 
(C = 0), controls are more likely than cases to have had a hip fracture. Therefore, because estrogens lower the incidence 
of hip fractures, a control is less likely to be on estrogens than a case, and hence the A-Y odds ratio conditional on 
C = 0 would be greater than the causal odds ratio in the population. Other forms of selection bias in case-control 
studies, including some biases described by Berkson (1946) and incidence-prevalence bias, can also be represented by 
Figure 8.1 or modifications of it, as discussed by Hernán, Herndndez-Diaz, and Robins (2004). 





exposed jobs are eliminated for economic reasons and the workers laid 
off) like in Figures 8.5 and 8.6. 


e Self-selection bias, volunteer bias: Figures 8.3-8.6 can also represent a 
Berkson (1955) described the struc- study in which C is agreement to participate (1: no, 0: yes), A is cigarette 
ture of bias due to self-selection. smoking, Y is coronary heart disease, U is family history of heart disease, 
and W is healthy lifestyle. (L is any mediator between U and C such as 
heart disease awareness.) Under any of these structures, selection bias 
may be present if the study is restricted to those who volunteered or 

elected to participate (C = 0). 


e Selection affected by treatment received before study entry: Suppose that 
C in Figures 8.3-8.6 represents selection into the study (1: no, 0: yes) 


Robins, Hernán, and Rotnitzky and that treatment A took place before the study started. If treatment 
(2007) used causal diagrams to de- affects the probability of being selected into the study, then selection 
scribe the structure of bias due to bias is expected. The case of selection bias arising from the effect of 
the effect of pre-study treatments treatment on selection into the study can be viewed as a generalization 
on selection into the study. of self-selection bias. This bias may be present in any study that at- 


tempts to estimate the causal effect of a treatment that occurred before 
the study started or in which treatment includes a pre-study component. 
For example, selection bias may arise when treatment is measured as the 
lifetime exposure to certain factor (medical treatment, lifestyle behav- 
ior...) in a study that recruited 50 year-old participants. In addition to 
selection bias, it is also possible that there exists unmeasured confound- 
ing for the pre-study component of treatment if confounders were only 
measured during the study. 


In addition to the biases described here, as well as in Fine Point 8.1 and 
Technical Point 8.1, causal diagrams have been used to characterize various 


8.3 Selection bias and confounding 


For example, selection bias may be 
induced by attempts to eliminate 
ascertainment bias (Robins 2001), 
to estimate direct effects (Cole and 
Hernán 2002), and by conventional 
adjustment for variables affected by 
previous treatment (see Part III). 
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other biases that arise from conditioning on a common effect. These examples 
show that selection bias may occur in retrospective studies—those in which data 
on treatment A are collected after the outcome Y occurs—and in prospective 
studies—those in which data on treatment A are collected before the outcome 
Y occurs. Further, these examples show that selection bias may occur both in 
observational studies and in randomized experiments. 

Take Figures 8.3 and 8.4, which could depict either an observational study 
or an experiment in which treatment A is randomly assigned, because there are 
no common causes of A and any other variable. Individuals in both randomized 
experiments and observational studies may be lost to follow-up or drop out of 
the study before their outcome is ascertained. When this happens, the risk 
Pr[Y = 1|A = a] cannot be computed because the value of the outcome Y is 
unknown for the censored individuals (C = 1). Therefore only the risk among 
the uncensored Pr[Y = 1|A = a,C = 0] can be computed. This restriction of 
the analysis to the uncensored individuals may induce selection bias because 
uncensored individuals who remained through the end of the study (C = 0) 
may not be exchangeable with individuals that were lost (C = 1). 

Hence a key difference between confounding and selection bias: random- 
ization protects against confounding, but not against selection bias when the 
selection occurs after the randomization. On the other hand, no bias arises 
in randomized experiments from selection into the study before treatment is 
assigned. For example, only volunteers who agree to participate are enrolled 
in randomized clinical trials, but such trials are not affected by volunteer bias 
because participants are randomly assigned to treatment only after agreeing to 
participate (C = 0). Thus none of Figures 8.3-8.6 can represent volunteer bias 
in a randomized trial. Figures 8.3 and 8.4 are eliminated because treatment 
cannot cause agreement to participate C. Figures 8.5 and 8.6 are eliminated 
because, as a result of the random treatment assignment, there cannot exist a 
common cause of treatment and any other variable. 


8.3 Selection bias and confounding 
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Figure 8.7 


For the same reason, social scien- 
tists often refer to unmeasured con- 
founding as selection on unobserv- 
ables. 


In this and the previous chapter, we describe two reasons why the treated and 
the untreated may not be exchangeable: 1) the presence of common causes of 
treatment and outcome, and 2) conditioning on common effects of treatment 
and outcome (or causes of them). We refer to biases due to the presence of 
common causes as “confounding” and to those due to conditioning on common 
effects as “selection bias.” This structural definition provides a clear-cut clas- 
sification of confounding and selection bias, even though it might not coincide 
perfectly with the traditional terminology of some disciplines. For example, 
statisticians and econometricians often use the term “selection bias” to refer 
to both types of biases. Their rationale is that in both cases the bias is due 
to selection: selection of individuals into the analysis (the structural “selection 
bias”) or selection of individuals into a treatment (the structural “confound- 
ing”). Our goal, however, is not to be normative about terminology, but rather 
to emphasize that, regardless of the particular terms chosen, there are two dis- 
tinct causal structures that lead to bias. 

The end result of both structures is lack of exchangeability between the 
treated and the untreated—which implies that these two biases occur even 
under the null. For example, consider a study restricted to firefighters that 
aims to estimate the causal effect of being physically active A on the risk 
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Technical Point 8.1 


The built-in selection bias of hazard ratios. The causal DAG in Figure 8.8 describes a randomized experiment of the 
effect of heart transplant A on death at times 1 (Y1) and 2 (Y2). The arrow from A to Yj represents that transplant 
decreases the risk of death at time 1. The lack of an arrow from A to Y> indicates that A has no direct effect on death 
at time 2. That is, heart transplant does not influence the survival status at time 2 of any individual who would survive 


past time 1 when untreated (and thus when treated). U is an unmeasured haplotype that decreases the individual's risk 
Pr[¥1=1|A=1] 


of death at all times. Because of the absence of confounding, the associational risk ratios aRRay, = Prl¥,=1/A=o] and 
aRRay, = Yeo are unbiased measures of the effect of A on death at times 1 and 2, respectively. Even though 


A has no direct effect on Y2, aRRay, will be less than 1 because it is a measure of the effect of A on total mortality 
through time 2. 

Consider now the time-specific hazard ratio (which, for all practical purposes, is equivalent to the rate ratio). In 
discrete time, the hazard of death at time 1 is the probability of dying at time 1 and thus the associational hazard ratio 
is the same as aRRay,. However, the hazard at time 2 is the probability of dying at time 2 among those who survived 
past time 1. Thus, the associational hazard ratio at time 2 is then aRRay,|y,=0 = Sree The square 
around Yj in Figure 8.8 indicates this conditioning. Treated survivors of time 1 are less likely than untreated survivors of 
time 1 to have the protective haplotype U (because treatment can explain their survival) and therefore are more likely 
to die at time 2. That is, conditional on Y, treatment A is associated with a higher mortality at time 2. Thus, the 
hazard ratio at time 1 is less than 1, whereas the hazard ratio at time 2 is greater than 1, i.e., the hazards have crossed. 
We conclude that the hazard ratio at time 2 is a biased estimate of the direct effect of treatment on mortality at time 
2. The bias is selection bias arising from conditioning on a common effect Yı of treatment A and of U, which is a cause 
of Yə that opens the associational path A — Yı — U — Yə between A and Y>. In the survival analysis literature, an 
unmeasured cause of death that is marginally unassociated with treatment such as U is often referred to as a frailty. 

In contrast, the conditional hazard ratio aRRay,|y,=0,u is 1 within each stratum of U because the path A — 
Yı — U — Y is now blocked by conditioning on the non-collider U. Thus, the conditional hazard ratio correctly 
indicates the absence of a direct effect of A on Yj. That the unconditional hazard ratio aRRyy,\y,=0 differs from the 
stratum-specific hazard ratios aRR ays|y,=0,y, even though U is independent of A, shows the noncollapsibility of the 
hazard ratio (Greenland, 1996b). Unfortunately, the unbiased measure aRRay,\y,=0,u Of the direct effect of A on Y> 
cannot be computed because U is unobserved. In the absence of data on U, it is impossible to know whether A has a 
direct effect on Y2. That is, the data cannot determine whether the true causal DAG generating the data was that in 
Figure 8.8 or in Figure 8.9. All of the above applies to both observational studies and randomized experiments. 





of heart disease Y as represented in Figure 8.7. For simplicity, we assume 
that, unknown to the investigators, A does not cause Y. Parental socioe- 
conomic status L affects the risk of becoming a firefighter C and, through 














A>] Yi >Y, childhood diet, of heart disease Y. Attraction toward activities that involve 
physical activity (an unmeasured variable U) affects the risk of becoming a 
firefighter and of being physically active (A). U does not affect Y, and L does 

U not affect A. According to our terminology, there is no confounding because 


there are no common causes of A and Y. Thus, the associational risk ratio 
Figure 8.8 Pr[Y = 1|A = 1] /Pr[Y = 1|A = 0] is expected to equal the causal risk ratio 

Pry at) Pr yes 
However, in a study restricted to firefighters (C = 0), the associational 
and causal risk ratios would differ because conditioning on a common effect C 
a aa S83 of causes of treatment and outcome induces selection bias resulting in lack of 
Ae Y | YY exchangeability of the treated and untreated firefighters. To the study investi- 
gators, the distinction between confounding and selection bias is moot because, 
Figure 8.9 regardless of nomenclature, they must adjust for L to make the treated and 
the untreated firefighters comparable. This example demonstrates that a struc- 
tural classification of bias does not always have consequences for the analysis 
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8.4 Selection bias and censoring 


The choice of terminology usually 
has no practical consequences, but 
disregard for the causal structure 
may lead to apparent paradoxes. 
For example, the so-called Simp- 
son's paradox (1951) was the re- 
sult of ignoring the difference be- 
tween common causes and common 
effects. Interestingly, Blyth (1972) 
failed to grasp the causal structure 
of the paradox in Simpson’s exam- 
ple and misrepresented it as an ex- 
treme case of confounding. Be- 
cause most people read Blyth’s pa- 
per but not Simpson's paper, the 
misunderstanding was perpetuated. 
See Hernán, Clayton, and Keiding 
(2011) for details. 
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of a study. Indeed, for this reason, many epidemiologists use the term “con- 
founder” for any variable L that needs to be adjusted for, regardless of whether 
the lack of exchangeability is the result of conditioning on a common effect or 
the result of a common cause of treatment and outcome. 

There are, however, advantages of adopting a structural approach to the 
classification of sources of non-exchangeability. First, the structure of the 
problem frequently guides the choice of analytical methods to reduce or avoid 
the bias. For example, in longitudinal studies with time-varying treatments, 
identifying the structure allows us to detect situations in which adjustment 
for confounding via stratification would introduce selection bias (see Part III). 
In those cases, g-methods are a better alternative. Second, even when under- 
standing the structure of bias does not have implications for data analysis (like 
in the firefighters’ study), it could still help study design. For example, inves- 
tigators running a study restricted to firefighters should make sure that they 
collect information on joint risk factors for the outcome Y and for the selection 
variable C (i.e., becoming a firefighter), as described in the first example of 
confounding in Section 7.1. Third, selection bias resulting from conditioning 
on pre-treatment variables (e.g., being a firefighter) could explain why cer- 
tain variables behave as “confounders” in some studies but not others. In our 
example, parental socioeconomic status L would not necessarily need to be 
adjusted for in studies not restricted to firefighters. Finally, causal diagrams 
enhance communication among investigators and may decrease the occurrence 
of misunderstandings. 

As an example of the last point, consider the “healthy worker bias”. We 
described this bias in the previous section as an example of a bias that arises 
from conditioning on the variable C, which is a common effect of (a cause of) 
treatment and (a cause of) the outcome. Thus the bias can be represented 
by the causal diagrams in Figures 8.3-8.6. However, the term “healthy worker 
bias” is also used to describe the bias that occurs when comparing the risk in 
certain group of workers with that in a group of individuals from the general 
population. 

This second bias can be depicted by the causal diagram in Figure 7.1 in 
which L represents health status, A represents membership in the group of 
workers, and Y represents the outcome of interest. There are arrows from L to 
A and Y because being healthy affects job type and risk of subsequent outcome, 
respectively. In this case, the bias is caused by the common cause L and we 
would refer to it as confounding. The use of causal diagrams to represent the 
structure of the “healthy worker bias” prevents any confusions that may arise 
from employing the same term for different sources of non-exchangeability. 

All the above considerations ignore the magnitude or direction of selec- 
tion bias and confounding. However, it is possible that some noncausal paths 
opened by conditioning on a collider are weak and thus induce little bias. Be- 
cause selection bias is not an “all or nothing” issue, in practice, it is important 
to consider the expected direction and magnitude of the bias (see Fine Point 
8.2). 


8.4 Selection bias and censoring 


Suppose an investigator conducted a marginally randomized experiment to 
estimate the average causal effect of wasabi intake on the one-year risk of 
death (Y = 1). Half of the 60 study participants were randomly assigned to 
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For example, we may want to com- 


pute the causal risk ratio 
E [Veter /E Page| 
or the causal risk difference 


E [yee 2] 


— E [y=] 


Selection bias 


eating meals supplemented with wasabi (A = 1) until the end of follow-up or 
death, whichever occurred first. The other half were assigned to meals that 
contained no wasabi (A = 0). After 1 year, 17 individuals died in each group. 
That is, the associational risk ratio Pr [Y = 1|A = 1] / Pr [Y = 1|A = 0] was 1. 
Because of randomization, the causal risk ratio Pr [Y°=! = 1] / Pr [Y°= = 1] 
is also expected to be 1. (If ignoring random variability bothers you, please 
imagine the study had 60 million patients rather than 60.) 

Unfortunately, the investigator could not observe the 17 deaths that oc- 
curred in each group because many patients were lost to follow-up, or censored, 
before the end of the study (i.e., death or one year after treatment assignment). 
The proportion of censoring (C = 1) was higher among patients with heart dis- 
ease (L = 1) at the start of the study and among those assigned to wasabi sup- 
plementation (A = 1). In fact, only 9 individuals in the wasabi group and 22 
individuals in the other group were not lost to follow-up. The investigator ob- 
served 4 deaths in the wasabi group and 11 deaths in the other group. That is, 
the associational risk ratio Pr[Y = 1|A = 1,C =0]/Pr[Y = 1|A = 0,C = 0] 
was (4/9)/(11/22) = 0.89 among the uncensored. The risk ratio of 0.89 in 
the uncensored differs from the causal risk ratio of 1 in the entire population: 
There is selection bias due to conditioning on the common effect C. 

The causal diagram in Figure 8.3 depicts the relation between the variables 
L, A, C, and Y in the randomized trial of wasabi. U represents atherosclerosis, 
an unmeasured variable, that affects both heart disease L and death Y. Figure 
8.3 shows that there are no common causes of A and Y, as expected in a 
marginally randomized experiment, and thus there is no need to adjust for 
confounding to compute the causal effect of A on Y. On the other hand, 
Figure 8.3 shows that there is a common cause U of C and Y. The presence 
of this backdoor path C — L — U — Y implies that, were the investigator 
interested in estimating the causal effect of censoring C on Y (which is null in 
Figure 8.3), she would have to adjust for confounding due to the common cause 
U. The backdoor criterion says that such adjustment is possible because the 
measured variable L can be used to block the backdoor path C — L — U > Y. 

The causal contrast we have considered so far is “the risk if everybody 
had been treated”, Pr dees = 1] , versus “the risk if everybody had remained 
untreated”, Pr ee? = 1], and this causal contrast does not involve C at all. 
Why then are we talking about confounding for the causal effect of C? It turns 
out that the causal contrast of interest needs to be modified in the presence 
of censoring or, in general, of selection. Because selection bias would not exist 
if everybody had been uncensored C = 0, we would like to consider a causal 
contrast that reflects what would have happened in the absence of censoring. 

Let Y¢=!<=° be an individual’s counterfactual outcome if he had received 
treatment A = 1 and he had remained uncensored C = 0. Similarly, let 
y2=.-=9 be an individual’s counterfactual outcome if he had not received 
treatment A = 0 and he had remained uncensored C = 0. Our causal contrast 
of interest is now “the risk if everybody had been treated and had remained 
uncensored”, Pr [Y@=1:°-° = 1], versus “the risk if everybody had remained 
untreated and uncensored”, Pr [Y7=%°=° = 1]. 

Often it is reasonable to assume that censoring does not have a causal 
effect on the outcome (an exception would be a setting in which being lost to 
follow-up prevents people from getting additional treatment). Because of the 
lack of effect of censoring C on the outcome Y, one might imagine that the 
definition of causal effect could ignore censoring, i.e., that we could omit the 
superscript c = 0. However, omitting the superscript would obscure the fact 
that considerations about confounding for C become central when computing 
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In causal diagrams with no arrow 
from censoring C to the observed 
outcome Y, we could replace Y by 
the counterfactual outcome Y °=? 
and add arrows Y=? —> Y and 
C—Y. 
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the causal effect of A on Y in the presence of selection bias. In fact, when 
conceptualizing the causal contrast of interest in terms of Y%°=°, we can think 
of censoring C as just another treatment. That is, the goal of the analysis is 
to compute the causal effect of a joint intervention on A and C. To eliminate 
selection bias for the effect of treatment A, we need to adjust for confounding 
for the effect of treatment C. 

Since censoring C is now viewed as a treatment, it follows that we will need 
to (i) ensure that the identifiability conditions of exchangeability, positivity, 
and consistency hold for C as well as for A, and (ii) use analytical methods 
that are identical to those we would have to use if we wanted to estimate the 
effect of censoring C. Under these identifiability conditions and using these 
methods, selection bias can be eliminated via analytic adjustment and, in the 
absence of measurement error and confounding, the causal effect of treatment 
A on outcome Y can be identified. The next section explains how to do so. 


8.5 How to adjust for selection bias 


We have described IP weights to 
adjust for confounding, W4 = 
1/f (A|L), and selection bias. 
W° =1/Pr[C = 0|A, L]. When 
both confounding and selection bias 
exist, the product weight w4w?e 
can be used to adjust simultane- 
ously for both biases under assump- 
tions described in Chapter 12 and 
Part III. 


Though selection bias can sometimes be avoided by an adequate design (see 
Fine Point 8.1), it is often unavoidable. For example, loss to follow up, self- 
selection, and, in general, missing data leading to bias can occur no matter how 
careful the investigator. In those cases, the selection bias needs to be explicitly 
corrected in the analysis. This correction can sometimes be accomplished by 
IP weighting (or by standardization), which is based on assigning a weight WC 
to each selected individual (C = 0) so that she accounts in the analysis not 
only for herself, but also for those like her, i.e., with the same values of L and 
A, who were not selected (C = 1). The IP weight WC is the inverse of the 
probability of her selection Pr [C = 0|L, A]. 

To describe the application of IP weighting for selection bias adjustment 
consider again the wasabi randomized trial described in the previous section. 
The tree graph in Figure 8.10 presents the trial data. Of the 60 individuals in 
the trial, 40 had (L = 1) and 20 did not have (L = 0) heart disease at the time 
of randomization. Regardless of their L status, all individuals had a 50/50 
chance of being assigned to wasabi supplementation (A = 1). Thus 10 individ- 
uals in the L = 0 group and 20 in the L = 1 group received treatment A = 1. 
This lack of effect of L on A is represented by the lack of an arrow from L to A 
in the causal diagram of Figure 8.3. The probability of remaining uncensored 
varies across branches in the tree. For example, 50% of the individuals without 
heart disease that were assigned to wasabi (L = 0, A = 1), whereas 60% of 
the individuals with heart disease that were assigned to no wasabi (L = 1, 
A = 0), remained uncensored. This effect of A and L on C is represented 
by arrows from A and L into Č in the causal diagram of Figure 8.3. Finally, 
the tree shows how many people would have died (Y = 1) both among the 
uncensored and the censored individuals. Of course, in real life, investigators 
would never know how many deaths occurred among the censored individuals. 
It is precisely the lack of this knowledge which forces investigators to restrict 
the analysis to the uncensored, opening the door for selection bias. Here we 
show the deaths in the censored to document that, as depicted in Figure 8.3, 
treatment A is marginally independent of Y, and censoring C’ is independent 
of Y within levels of L. It can also be checked that the risk ratio in the entire 
population (inaccessible to the investigator) is 1 whereas the risk ratio in the 
uncensored (accessible to the investigator) is 0.89. 
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Let us now describe the intuition behind the use of IP weighting to adjust 
for selection bias. Look at the bottom of the tree in Figure 8.10. There 
are 20 individuals with heart disease (L = 1) who were assigned to wasabi 
supplementation (A = 1). Of these, 4 remained uncensored and 16 were lost 
to follow-up. That is, the conditional probability of remaining uncensored in 
this group is 1/5, i.e., Pr[C = 0|L = 1, A = 1] = 4/20 = 0.2. In an IP weighted 
analysis the 16 censored individuals receive a zero weight (i.e., they do not 
contribute to the analysis), whereas the 4 uncensored individuals receive a 
weight of 5, which is the inverse of their probability of being uncensored (1/5). 
IP weighting replaces the 20 original individuals by 5 copies of each of the 
4 uncensored individuals. The same procedure can be repeated for the other 
branches of the tree, as shown in Figure 8.11, to construct a pseudo-population 
of the same size as the original study population but in which nobody is lost to 
follow-up. (We let the reader derive the IP weights for each branch of the tree.) 
The associational risk ratio in the pseudo-population is 1, the same as the risk 
ratio Pr [Y¢=}e~® = 1] / Pr [Y°=°°° = 1] that would have been computed in 
the original population if nobody had been censored. 
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The association measure in the pseudo-population equals the effect measure 
in the original population if the following three identifiability conditions are 
met. 

First, the average outcome in the uncensored individuals must equal the 
unobserved average outcome in the censored individuals with the same val- 
ues of A and L. This provision will be satisfied if the probability of selection 
Pr[C = 0|L = 1, A = 1] is calculated conditional on treatment A and on all 
additional factors that independently predict both selection and the outcome, 
that is, if the variables in A and L are sufficient to block all backdoor paths 
between C and Y. Unfortunately, one can never be sure that these additional 
factors were identified and recorded in L, and thus the causal interpretation 
of the resulting adjustment for selection bias depends on this untestable ez- 
changeability assumption. 

Second, IP weighting requires that all conditional probabilities of being 
uncensored given the variables in L must be greater than zero. Note this 
positivity condition is required for the probability of being uncensored (C = 0) 
but not for the probability of being censored (C = 1) because we are not 
interested in inferring what would have happened if study individuals had 
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A competing event is an event that 
prevents the outcome of interest 
from happening. A typical exam- 
ple of competing event is death be- 
cause, once an individual dies, no 
other outcomes can occur. 





Figure 8.12 





Figure 8.13 
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been censored, and thus there is no point in constructing a pseudo-population 
in which everybody is censored. For example, the tree in Figure 8.10 shows 
that Pr[C = 1|L = 0, A = 0] = 0, but this zero does not affect our ability to 
construct a pseudo-population in which nobody is censored. 

The third condition is consistency, including well-defined interventions. IP 
weighting is used to create a pseudo-population in which censoring C has been 
abolished, and in which the effect of the treatment A is the same as in the 
original population. Thus, the pseudo-population effect measure is equal to 
the effect measure had nobody been censored. This effect measure may be 
relatively well defined when censoring is the result of loss to follow up or non- 
response, but not when censoring is defined as the occurrence of a competing 
event. For example, in a study aimed at estimating the effect of certain treat- 
ment on the risk of Alzheimer’s disease, death from other causes (cancer, heart 
disease, and so on) is a competing event. Defining death as a form of censoring 
is problematic: we might not wish to base our effect estimates on a pseudo- 
population in which all other causes of death have been removed, because it 
is unclear even conceptually what sort of intervention would produce such a 
population. Also, no feasible intervention could possibly remove just one cause 
of death without affecting the others as well. 

Finally, one could argue that IP weighting is not necessary to adjust for 
selection bias in a setting like that described in Figure 8.3. Rather, one might 
attempt to remove selection bias by stratification (i.e., by estimating the ef- 
fect measure conditional on the L variables) rather than by IP weighting. 
Stratification could yield unbiased conditional effect measures within levels of 
L because conditioning on L is sufficient to block the backdoor path from C 
to Y. That is, the conditional risk ratio 


Pr[Y =1|A=1,C =0,L=]] /Pr[Y =1|A=0,C =0,L=]] 




















can be interpreted as the effect of treatment among the uncensored with L = l. 
For the same reason, under the null, stratification would work (i.e., it would 
provide an unbiased conditional effect measure) if the data can be represented 
by the causal structure in Figure 8.5. Stratification, however, would not work 
under the structure depicted in Figures 8.4 and 8.6. Take Figure 8.4. Condi- 
tioning on L blocks the backdoor path from C to Y but also opens the path 
A — L< U — Y from A to Y because L is a collider on that path. Thus, 
even if the causal effect of A on Y is null, the conditional (on L) risk ratio 
would be generally different from 1. And similarly for Figure 8.6. In contrast, 
IP weighting appropriately adjusts for selection bias under Figures 8.3-8.6 be- 
cause this approach is not based on estimating effect measures conditional on 
the covariates L, but rather on estimating unconditional effect measures after 
reweighting the individuals according to their treatment and their values of L. 

This is the first time we discuss a situation in which stratification cannot 
be used to validly compute the causal effect of treatment, even if the three 
conditions of exchangeability, positivity, and consistency hold. We will discuss 
other situations with a similar structure in Part HI when considering the effect 
of time-varying treatments. 


The causal diagram in Figure 8.12 represents a hypothetical study with di- 
chotomous variables surgery A, certain genetic haplotype Æ, and death Y. 
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Technical Point 8.2 


Multiplicative survival model. When the conditional probability of survival Pr [Y = 0|E = e, A = a] given A and E is 
equal to a product g(e)h(a) of functions of e and a, we say that a multiplicative survival model holds. A multiplicative 
survival model 








Pr [Y = 0|E =e, A = a] = g(e)h(a) 


is equivalent to a model that assumes the survival ratio Pr [Y = 0|E =e, A = a] / Pr [Y =0|E =e, A = 0] does not 
depend on e and is equal to h(a). The data follow a multiplicative survival model when there is no interaction 
between A and E on the multiplicative scale as depicted in Figure 8.13. If Pr[Y = 0|E =e, A =a] = g(e)h(a), then 
Pr[Y = 1|E = e, A =a] = 1 — g(e)h(a) does not follow a multiplicative mortality model. Hence, when A and E are 
conditionally independent given Y = 0, they will be conditionally dependent given Y = 1. 

















According to the rules of d-separation, surgery A and haplotype E are (i) mar- 
ginally independent, i.e., the probability of receiving surgery is the same for 
people with and without the genetic haplotype, and (ii) associated condition- 
ally on Y, i.e., the probability of receiving surgery varies by haplotype when 
the study is restricted to, say, the survivors (Y = 0). 

Indeed conditioning on the common effect Y of two independent causes A 


Bl 
° 


A 7 and E always induces a conditional association between A and E in at least 
one of the strata of Y (say, Y = 1). However, there is a special situation under 
which A and E remain conditionally independent within the other stratum 

E — |Y; (say, Y = 0). l l l 

E Suppose A and E affect survival through totally independent mechanisms 
Figure 8.14 in such a way that E cannot possibly modify the effect of A on Y, and vice 


versa. For example, suppose that the surgery A affects survival through the 
removal of a tumor, whereas the haplotype E affects survival through increasing 
levels of low-density lipoprotein-cholesterol levels resulting in an increased risk 
of heart attack (whether or not a tumor is present). In this scenario, we can 
consider 3 cause-specific mortality variables: death from tumor Y4, death from 
heart attack Yp, and death from any other causes Yo. The observed mortality 
variable Y is equal to 1 (death) when Y4 or Ypg or Yo is equal to 1, and Y is 
equal to 0 (survival) when Y4 and Yp and Yo equal 0. The causal diagram in 
Figure 8.13, an expansion of that in Figure 8.12, represents a causal structure 
linking all these variables. We assume data on underlying cause of death (Ya, 
Ye, Yo) are not recorded and thus the only measured variables are those in 
Figure 8.15 Figure 8.12 (A, E, Y). 
Because the arrows from Y4, Yg and Yo to Y are deterministic, condition- 
ing on observed survival (Y = 0) is equivalent to simultaneously conditioning 
on Y4 = 0, Ye = 0, and Yo = 0 as well, i.e., conditioning on Y = 0 implies 
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Wi Ya = Yp = Yo = 0. As a consequence, we find by applying d-separation to 
D Figure 8.13 that A and E are conditionally independent given Y = 0, i.e., 
when conditioning on collider Y = 0, the path between A and E through 

W; Y is blocked by conditioning on the non-colliders Y4, Yg and Yo. On the 


other hand, conditioning on death Y = 1 does not imply conditioning on any 

specific values of Y4, Ye and Yo as the event Y = 1 is compatible with 7 pos- 

A sible unmeasured events: (Y4 = 1, Yg = 0, Yo = 0), (Ya = 0, Yg = 1, Yo = 0), 

(Ya =0, Yp = 0, Yo = 1), (Ya = 1, Ys = 1, Yo = 0), (Ya = 0, Yp = 1, Yo = 1), 

(Ya = 1, Ye = 0, Yo = 1), and (Y4 = 1,Yp = 1, Yo = 1). Thus, A and E are 

associated given Y = 1, i.e., when conditioning on collider Y = 1, the path 
E between A and E through Y is not blocked. 












































Figure 8.16 
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Fine Point 8.2 


The strength and direction of selection bias. We have referred to selection bias as an “all or nothing” issue: either 
bias exists or it doesn't. In practice, however, it is important to consider the expected direction and magnitude of the 
bias. 

The direction of the conditional association between 2 marginally independent causes A and E within strata of 
their common effect Y depends on how the two causes A and E interact to cause Y. For example, suppose that, in 
the presence of an undiscovered background factor U that is unassociated with A or E, having either A= 1 or E = 1 
is sufficient and necessary to cause death (an “or’ mechanism), but that neither A nor E causes death in the absence 
of U. Then among those who died (Y = 1), A and E will be negatively associated, because it is more likely that an 
individual with A = 0 had E = 1 because the absence of A increases the chance that E was the cause of death. (Indeed, 
the logarithm of the conditional odds ratio OR 4p\y=1 will approach minus infinity as the population prevalence of U 
approaches 1.0.) This “or” mechanism was the only explanation given in the main text for the conditional association 
of independent causes within strata of a common effect; nonetheless, other possibilities exist. 

For example, suppose that in the presence of the undiscovered background factor U, having both A = 1 and E = 1 
is sufficient and necessary to cause death (an “and” mechanism) and that neither A nor E causes death in the absence 
of U. Then, among those who die, those with A = 1 are more likely to have E = 1, i.e., A and E are positively 
correlated. A standard DAG such as that in Figure 8.12 fails to distinguish between the case of A and E interacting 
through an “or” mechanism from the case of an “and” mechanism. Causal DAGs with sufficient causation structures 
(VanderWeele and Robins, 2007c) overcome this shortcoming. 

Regardless of the direction of selection bias, another key issue is its magnitude. Biases that are not large enough 
to affect the conclusions of the study may be safely ignored in practice, whether the bias is upwards or downwards. 
Generally speaking, a large selection bias requires strong associations between the collider and both treatment and 
outcome. Greenland (2003) studied the magnitude of selection bias under the null, which he referred to as collider- 
stratification bias, in several scenarios. 


In contrast with the situation represented in Figure 8.13, the variables 
A and E will not be independent conditionally on Y = 0 when one of the 
situations represented in Figures 8.14-8.16 occur. If A and E affect survival 
through a common mechanism, then there will exist an arrow either from A 
to Ym or from E to Y4, as shown in Figure 8.14. In that case, A and E 
will be dependent within both strata of Y. Similarly, if Y4 and Yp are not 
independent because of a common cause V as shown in Figure 8.15, A and E 
will be dependent within both strata of Y. Finally, if the causes Y4 and Yo, 
and Yr and Yo, are not independent because of common causes W1 and W2 as 
shown in Figure 8.16, then A and F will also be dependent within both strata 
of Y. When the data can be summarized by Figure 8.13, we say that the data 
follow a multiplicative survival model (see Technical Point 8.2). 
What is interesting about Figure 8.13 is that by adding the unmeasured 
variables Y4, Yg and Yo, which functionally determine the observed variable 
Augmented causal DAGs, intro- Y, we have created an augmented causal diagram that succeeds in representing 
duced by Hernán, Herndndez-Diaz, both the conditional independence between A and E given Y = 0 and the their 
and Robins (2004), can be ex- conditional dependence given Y = 1. 


tended to represent the sufficient In summary, conditioning on a collider always induces an association be- 
causes described in Chapter 5(Van- tween its causes, but this association could be restricted to certain levels of the 
derWeele and Robins, 2007c). common effect. In other words, it is theoretically possible that selection on a 


common effect does not result in selection bias when the analysis is restricted 
to a single level of the common effect. Collider stratification is not always a 
source of selection bias. 


Chapter 9 
MEASUREMENT BIAS 


Suppose an investigator conducted a randomized experiment to answer the causal question “does one’s looking 
up to the sky make other pedestrians look up too?” She found a weak association between her looking up and 
other pedestrians’ looking up. Does this weak association reflect a weak causal effect? By definition of randomized 
experiment, confounding bias is not expected in this study. In addition, no selection bias was expected because 
all pedestrians’ responses—whether they did or did not look up—were recorded. However, there was another 
problem: the investigator’s collaborator who was in charge of recording the pedestrians’ responses made many 
mistakes. Specifically, the collaborator missed half of the instances in which a pedestrian looked up and recorded 
these responses as “did not look up.” Thus, even if the treatment (the investigator’s looking up) truly had a strong 
effect on the outcome (other people’s looking up), the misclassification of the outcome will result in a dilution of 
the association between treatment and the (mismeasured) outcome. 

We say that there is measurement bias when the association between treatment and outcome is weakened 
or strengthened as a result of the process by which the study data are measured. Since measurement errors can 
occur under any study design—including randomized experiments and observational studies—measurement bias 
need always be considered when interpreting effect estimates. This chapter provides a description of biases due to 
measurement error. 


9.1 Measurement error 


In previous chapters we implicitly made the unrealistic assumption that all 
variables were perfectly measured. Consider an observational study designed to 
estimate the effect of a cholesterol-lowering drug A on the risk of liver disease Y. 
U A We often expect that treatment A will be measured imperfectly. For example, 
if the information on drug use is obtained by medical record abstraction, the 
| abstractor may make a mistake when transcribing the data, the physician may 
At forget to write down that the patient was prescribed the drug, or the patient 
may not take the prescribed treatment. Thus, the treatment variable in our 
f analysis data set will not be the true use of the drug, but rather the measured 
use of the drug. We will refer to the measured treatment as A* (read A-star), 
A which will not necessarily equal the true treatment A for a given individual. 
Figure 9.1 The psychological literature sometimes refers to A as the “construct” and to 
A* as the “measure” or “indicator.” The challenge in observational disciplines 
is making inferences about the unobserved construct (e.g., cholesterol-lowering 
drug use) by using data on the observed measure (e.g., information on statin 

use from medical records). 
The causal diagram in Figure 9.1 depicts the variables A, A*, and Y. For 
simplicity, we chose a setting with neither confounding nor selection bias for 
the causal effect of A on Y. The true treatment A affects both the outcome Y 
and the measured treatment A*. The causal diagram also includes the node U4 
to represent all factors other than A that determine the value of A*. We refer 
Measurement error for discrete vari- to U, as the measurement error for A. Note that the node U4 is unnecessary 
ables is known as misclassification. in discussions of confounding (it is not part of a backdoor path) or selection 
bias (no variables are conditioned on) and therefore we omitted it from the 
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Technical Point 9.1 


Independence and nondifferentiality. Let f(-) denote a probability density function (PDF). The measurement errors 
U 4 for treatment and Uy for outcome are independent if their joint PDF equals the product of their marginal PDFs, i.e., 
f(Uy, Ua) = f(Uy)f(Ua). The measurement error U4 for the treatment is nondifferential if its PDF is independent of 
the outcome Y, i.e., f(U4|Y) = f(U4). Analogously, the measurement error Uy for the outcome is nondifferential if 
its PDF is independent of the treatment A, i.e., f(Uy|A) = f(Uy). 
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Figure 9.2 


causal diagrams in Chapters 7 and 8. For the same reasons, the determinants 
of the variables A and Y are not included in Figure 9.1. 

Besides treatment A, the outcome Y can be measured with error too. The 
causal diagram in Figure 9.2 includes the measured outcome Y*, and the mea- 
surement error Uy for Y. Figure 9.2 illustrates a common situation in practice. 
One wants to compute the average causal effect of the treatment A on the out- 
come Y, but these variables A and Y have not been, or cannot be, measured 
correctly. Rather, only the mismeasured versions A* and Y* are available to 
the investigator who aims at identifying the causal effect of A on Y. 

Figure 9.2 also represents a setting in which there is neither confounding nor 
selection bias for the causal effect of treatment A on outcome Y. According to 
our reasoning in previous chapters, association is causation in this setting. We 
can compute any association measure and endow it with a causal interpretation. 
For example, the associational risk ratio Pr [Y = 1|A = 1] / Pr [Y = 1|A = 0] is 
equal to the causal risk ratio Pr [Y¢=! = 1] / Pr [Y°~° = 1]. Our implicit as- 
sumption in previous chapters, which we now make explicit, was that perfectly 
measured data on A and Y were available. 

We now consider the more realistic setting in which treatment and outcome 
are measured with error. Then there is no guarantee that the measure of 
association between A* and Y* will equal the measure of causal effect of A 
on Y. The associational risk ratio Pr [Y* = 1|A* = 1] / Pr [Y* = 1|A* = 0] will 
generally differ from the causal risk ratio Pr [Y°=1! = 1] / Pr [Y*~° = 1]. We 
say that there is measurement bias or information bias. In the presence of 
measurement bias, the identifiability conditions of exchangeability, positivity, 
and consistency are insufficient to compute the causal effect of treatment A on 
outcome Y. 


9.2 The structure of measurement error 
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Figure 9.3 


The causal structure of confounding can be summarized as the presence of 
common causes of treatment and outcome, and the causal structure of selec- 
tion bias can be summarized as conditioning on common effects of treatment 
and outcome (or of their causes). Measurement bias arises in the presence of 
measurement error, but there is no single structure to summarize measurement 
error. This section classifies the structure of measurement error according to 
two properties—independence and nondifferentiality—that we describe below 
(see Technical Point 9.1 for formal definitions). 

The causal diagram in Figure 9.2 depicts the measurement errors U4 and 
Uy for both treatment A and outcome Y, respectively. According to the rules 
of d-separation, the measurement errors U4 and Uy are independent because 
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the path between them is blocked by colliders (either A* or Y*). Independent 
errors are expected to arise if, for example, information on both drug use A 
and liver toxicity Y was obtained from electronic medical records in which data 
entry errors occurred haphazardly. In other settings, however, the measure- 
ment errors for exposure and outcome may be dependent, as depicted in Figure 
9.3. For example, dependent measurement errors will occur if the information 
were obtained retrospectively by phone interview and an individual’s ability to 
recall her medical history (U 4y ) affected the measurement of both A and Y. 

Both Figures 9.2 and 9.3 represent settings in which the error for treatment 
Ux is independent of the true value of the outcome Y, and the error for the 
outcome Uy is independent of the true value of treatment. We then say that the 
measurement error for treatment is nondifferential with respect to the outcome, 
and that the measurement error for the outcome is nondifferential with respect 
to the treatment. The causal diagram in Figure 9.4 shows an example of 
independent but differential measurement error in which the true value of the 
outcome affects the measurement of the treatment (i.e., an arrow from Y to 
U4). Some examples of differential measurement error of the treatment follow. 

Suppose that the outcome Y were dementia rather than liver toxicity, and 
that drug use A were ascertained by interviewing study participants. Since 
the presence of dementia affects the ability to recall A, one would expect an 
arrow from Y to Uy. Similarly, one would expect an arrow from Y to U4 ina 
study to compute the effect of alcohol use during pregnancy A on birth defects 
Y if alcohol intake is ascertained by recall after delivery—because recall may 
be affected by the outcome of the pregnancy. The resulting measurement bias 
in these two examples is often referred to as recall bias. A bias with the same 
structure might arise if blood levels of drug A* are used in place of actual drug 
use A, and blood levels are measured after liver toxicity Y is present—because 
liver toxicity affects the measured blood levels of the drug. The resulting 
measurement bias is often referred to as reverse causation bias. 

The causal diagram in Figure 9.5 shows an example of independent but 
differential measurement error in which the true value of the treatment affects 
the measurement of the outcome (i.e., an arrow from A to Uy). A differential 
measurement error of the outcome will occur if physicians, suspecting that drug 
use A causes liver toxicity Y, monitored patients receiving drug more closely 
than other patients. Figures 9.6 and 9.7 depict measurement errors that are 
both dependent and differential, which may result from a combination of the 
settings described above. 

In summary, we have discussed four types of measurement error: indepen- 
dent nondifferential (Figure 9.2), dependent nondifferential (Figure 9.3), inde- 
pendent differential (Figures 9.4 and 9.5), and dependent differential (Figures 
9.6 and 9.7). The particular structure of the measurement error determines 
the methods that can be used to correct for it. For example, there is a large 
literature on methods for measurement error correction when the measurement 
error is independent nondifferential. In general, methods for measurement er- 
ror correction rely on a combination of modeling assumptions and validation 
samples, i.e., subsets of the data in which key variables are measured with 
little or no error. The description of methods for measurement error correc- 
tion is beyond the scope of this book. Rather, our goal is to highlight that 
the act of measuring variables (like that of selecting individuals) may intro- 
duce bias (see Fine Point 9.1 for a discussion of its strength and direction). 
Realistic causal diagrams need to simultaneously represent biases arising from 
confounding, selection, and measurement. The best method to fight bias due 
to mismeasurement is, obviously, to improve the measurement procedures. 
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Fine Point 9.1 


The strength and direction of measurement bias. In general, measurement error will result in bias. A notable 
exception is the setting in which A and Y are unassociated and the measurement error is independent and nondifferential: 
If the arrow from A to Y did not exist in Figure 9.2, then both the A-Y association and the A*-Y* association would 
be null. In all other circumstances, measurement bias may result in an A*-Y™* association that is either further from 
or closer to the null than the A-Y association. Worse, for non-dichotomous treatments, measurement bias may result 
in A*-Y* and A-Y associations in opposite directions. This association or trend reversal may occur even under the 
independent and nondifferential measurement error structure of Figure 9.2 when the mean of A* is a nonmonotonic 
function of A. See Dosemeci, Wacholder, and Lubin (1990) and Weinberg, Umbach, and Greenland (1994) for details. 
VanderWeele and Hernán (2009) described a more general framework using signed causal diagrams. 

The magnitude of the measurement bias depends on the magnitude of the measurement error. That is, measurement 
bias generally increases with the strength of the arrows from U4 to A* and from Uy to Y*. Causal diagrams do not 
encode quantitative information, and therefore they cannot be used to describe the magnitude of the bias. 


9.3 Mismeasured confounders 


Besides the treatment A and the outcome Y, the confounders L may also be 
measured with error. Mismeasurement of confounders will result in bias even 
if both treatment and outcome are perfectly measured. To see this, consider 
the causal diagram in Figure 9.8, which includes the variables drug use A, liver 
disease Y, and history of hepatitis L. Individuals with prior hepatitis L are less 
likely to be prescribed drug A and more likely to develop liver disease Y. As 
i oa discussed in Chapter 7, there is confounding for the effect of the treatment A on 
the outcome Y because there exists an open backdoor path A — L — Y, but 
there is no unmeasured confounding given L because the backdoor path A — 
L — Y can be blocked by conditioning on L. That is, there is exchangeability 
of the treated and the untreated conditional on the confounder L, and one can 
Figure 9.8 apply IP weighting or standardization to compute the average causal effect: of 
Aon Y. The standardized, or IP weighted, risk ratio based on L, Y, and A 
will equal the causal risk ratio Pr [Y°=! = 1] / Pr [Y°= = 1]. 
Again the implicit assumption in the above reasoning is that the confounder 
L was perfectly measured. Suppose investigators did not have access to the 
study participants’ medical records. Rather, to ascertain previous diagnoses of 
hepatitis, investigators had to ask participants via a questionnaire. Since not all 
participants provided an accurate recollection of their medical history—some 
did not want anyone to know about it, others had memory problems or simply 
made a mistake when responding to the questionnaire—the confounder L was 
L measured with error. Investigators had data on the mismeasured variable L* 
f rather than on the variable L. Unfortunately, the backdoor path A — L — Y 
L 


— 
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cannot be generally blocked by conditioning on L*. The standardized (or 
IP weighted) risk ratio based on L*, Y, and A will generally differ from the 
causal risk ratio Pr [Y¢=! = 1] /Pr[Y*=-° = 1]. We then say that there is 
f measurement bias or information bias. 

The causal diagram in Figure 9.9 shows an example of confounding of the 
U causal effect of A on Y in which L is not the common cause shared by A and 
Y. Here too mismeasurement of L leads to measurement bias because the 
backdoor path A — L — U — Y cannot be generally blocked by conditioning 
on L*. (Note that Figures 9.8 and 9.9 do not include the measurement error Uz, 
because the particular structure of this error is not relevant to our discussion.) 


Figure 9.9 
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Figure 9.10 


Alternatively, one could view the bias due to mismeasured confounders in 
Figures 9.8 and 9.9 as a form of unmeasured confounding rather than as a form 
of measurement bias. In fact the causal diagram in Figure 9.8 is equivalent 
to that in Figure 7.6. One can think of L as an unmeasured variable and of 
L* as a surrogate confounder (see Fine Point 7.2). The particular choice of 
terminology—unmeasured confounding versus bias due to mismeasurement of 
the confounders—is irrelevant for practical purposes. 

Mismeasurement of confounders may also result in apparent effect modi- 
fication. As an example, suppose that all study participants who reported a 
prior diagnosis of hepatitis (L* = 1) and half of those who reported no prior 
diagnosis of hepatitis (L* = 0) did actually have a prior diagnosis of hepatitis 
(L = 1). That is, the true and the measured value of the confounder coincide 
in the stratum L* = 1, but not in the stratum L* = 0. Suppose further that 
treatment A has no effect on any individual’s liver disease Y, i.e., the sharp 
null hypothesis holds. When investigators restrict the analysis to the stratum 
L* = 1, there will be no confounding by L because all participants included in 
the analysis have the same value of L (i.e., L = 1). Therefore they will find no 
association between A and Y in the stratum L* = 1. However, when the inves- 
tigators restrict the analysis to the stratum L* = 0, there will be confounding 
by L because the stratum L* = 0 includes a mixture of individuals with both 
L = 1 and L = 0. Thus the investigators will find an association between A 
and Y as a consequence of uncontrolled confounding by L. If the investigators 
are unaware of the fact that there is mismeasurement of the confounder in the 
stratum L* = 0 but not in the stratum L* = 1, they could naively conclude 
that both the association measure in the stratum L* = 0 and the association 
measure in the stratum L* = 1 can be interpreted as effect measures. Because 
these two association measures are different, the investigators will say that L* 
is a modifier of the effect of A on Y even though no effect modification by the 
true confounder L exists. 

Finally, it is also possible that a collider C is measured with error as repre- 
sented in Figure 9.10. In this setting, conditioning on the mismeasured collider 
C* will generally introduce selection bias because C* is a common effect of the 
treatment A and the outcome Y. 


9.4 Intention-to-treat effect: the effect of a misclassified treatment 


Consider a marginally randomized experiment to compute the causal effect 
of heart transplant on 5-year mortality Y. So far in this book we have used 
the notation A = 1 to refer to the study participants who were assigned and 
therefore received treatment (heart transplant in this example), and A = 0 
to the others. This notation is appropriate for ideal randomized experiments 
in which all participants assigned to treatment actually received treatment, 
and in which all participants assigned to no treatment actually did not receive 
treatment. This notation, however is not detailed enough for real randomized 
experiments in which participants may not comply with the assigned treatment. 

In real randomized experiments we need to distinguish between two treat- 
ment variables: the assigned treatment Z (1 if the person is assigned to trans- 
plant, 0 otherwise) and the received treatment A (1 if the person receives a 
transplant, 0 otherwise). For a given individual, the value of Z and A may 
differ because of lack of adherence to the assigned treatment. For example, 
an individual randomly assigned to receive a heart transplant (Z = 1) may 
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not receive it (A = 0) because he refuses to undergo the surgical procedure, 
or an individual assigned to medical therapy only (Z = 0) may still obtain a 
transplant (A = 1) outside of the study. In that sense, when individuals do not 
adhere to their assigned treatment, the assigned treatment Z is a misclassified 
version of the treatment A that was truly received by the study participants. 
Figure 9.11 represents a randomized experiment with Z, A, and Y (the variable 
U is discussed in the next section). 

But there is a key difference between the assigned treatment Z in random- 
ized experiments and the misclassified treatments A* that we have considered 
so far. The mismeasured treatment A* in Figures 9.1-9.7 does not have a 
causal effect on the outcome Y. The association between A* and Y is entirely 
due to their common cause A. Indeed, in observational studies, one generally 
expects no causal effect of the measured treatment A* on the outcome, even 
if the true treatment A has a causal effect. On the other hand, as shown in 
Figure 9.11, the assigned treatment Z in randomized experiments can have a 
causal effect on the outcome Y through two different pathways. 

First, treatment assignment Z may affect the outcome Y simply because 
it affects the received treatment A. Individuals assigned to heart transplant 
are more likely to receive a heart transplant, as represented by the arrow from 
Z to A. If receiving a heart transplant has a causal effect on mortality, as 
represented by the arrow from A to Y, then assignment to heart transplant 
has a causal effect on the outcome Y through the pathway Z => A > Y. 

Second, treatment assignment Z may affect the outcome Y through path- 
ways that are not mediated by received treatment A. For example, awareness 
of the assigned treatment might lead to changes in the behavior of study par- 
ticipants: patients who are aware of receiving a transplant may spontaneously 
change their diet in an attempt to keep their new heart healthy, doctors may 
take special care of patients who were not assigned to a heart transplant... 
These behavioral changes are represented by the direct arrow from Z to Y. 

Hence, the causal effect of the assigned treatment Z is not equal to the effect 
of received treatment A because the magnitude of the effect of Z depends not 
only on the strength of the arrow A — Y (the effect of the received treatment), 
but also on the strength of the arrows Z —> A (the degree of adherence to 
the assigned treatment in the study) and Z —> Y (the concurrent behavioral 
changes). 

Often investigators try to partly “de-contaminate” the effect of Z by elim- 
inating the arrow Z — Y as shown in Figure 9.12, which depicts the exclusion 
restriction of no direct arrow from Z to Y (see Technical Point 9.2). To do 
so, they withhold knowledge of the assigned treatment Z from participants 
and their doctors. For example, if Z were aspirin the investigators would ad- 
minister an aspirin pill to those randomly assigned to Z = 1, and a placebo 
(an identical pill except that it does not contain aspirin) to those assigned 
to Z = 0. Because participants and their doctors do not know whether the 
pill they are given is the active treatment or a placebo, they are said to be 
“blinded” and the study is referred to as a double-blind placebo-controlled ran- 
domized experiment. A double-blind treatment assignment, however, is often 
unfeasible. For example, in our heart transplant study, there is no practical 
way of administering a convincing placebo for open heart surgery. 

Again, a key point is that the effect of Z does not measure “the effect 
of treating with A” but rather “the effect of assigning participants to being 
treated with A” or “the effect of having the intention of treating with A,” 
which is why the causal effect of randomized assignment Z is referred to as the 
intention-to-treat effect. Yet, despite its dependence on adherence and other 
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Technical Point 9.2 


The exclusion restriction. If the exclusion restriction holds, then there is no direct arrow from assigned treatment Z 
to the outcome Y, that is, that all of the effect of Z on Y is mediated through the received treatment A. Let Y° be 
the counterfactual outcome under randomized treatment assignment z and actual treatment received a. Formally, we 
say that the exclusion restriction holds when y7=0.a — y2=L4 for all individuals and all values a and, specifically, for 
the value A observed for each individual. Instrumental variable methods (see Chapter 16) rely critically on the exclusion 


restriction being true. 


9.5 Per-protocol effect 


factors, the effect of treatment assignment Z is the effect that investigators 
pursue in most randomized experiments. Why would one be interested in the 
effect of assigned treatment Z rather than in the effect of the treatment truly 
received A? The next section provides some answers to this question. 


In randomized experiments, the per-protocol effect is the causal effect of treat- 
ment that would have been observed if all individuals had adhered to their 
assigned treatment as specified in the protocol of the experiment. If all study 
participants happen to adhere to the assigned treatment, the values of assigned 
treatment Z and received treatment A coincide for all participants, and there- 
fore the per-protocol effect can be equivalently defined as either the average 
causal effect of Z or of A. As explained in Chapter 2, in ideal experiments 
with perfect adherence, the treated (A = 1) and the untreated (A = 0) are ex- 
changeable, Y“ ILA, and association is causation. The associational risk ratio 
Pr[Y = 1|A = 1]/Pr[Y = 1|A = 0] is expected to equal the causal risk ratio 
Pr[Y*=! = 1}/Pr[Y¢=° = 1], which measures the per-protocol effect on the 
risk ratio scale. 

Consider now a setting in which some individuals do not adhere to the 
assigned treatment so that their values of assigned treatment Z and received 
treatment A differ. For example, suppose that the most severely ill individuals 
in the Z = 0 group tend to seek a heart transplant (A = 1) outside of the 
study. If that occurs, then the group A = 1 would include a higher proportion 
of severely ill individuals than the group A = 0: the groups A = 1 and A = 0 
would not be exchangeable, and thus association between A and Y would not 
be causation. The associational risk ratio Pr[Y = 1|A = 1]/Pr[Y = 1|A = 0] 
would not equal the causal per-protocol risk ratio Pr[Y °=} = 1]/ Pr[Y °= = 1]. 

The setting described in the previous paragraph is represented by Figure 
9.11, with U representing severe illness (1: yes, 0: no). As indicated by the 
backdoor path A — U — Y, there is confounding for the effect of A on 
Y. Because the reasons why participants receive treatment A include prog- 
nostic factors U, computing the per-protocol effect requires adjustment for 
confounding. That is, computation of the per-protocol effect requires viewing 
the randomized experiment as an observational study. If the factors U remain 
unmeasured, the effect of received treatment A cannot be correctly computed. 
See Fine Point 9.2 for a description of approaches to quantify the per-protocol 
effect when the prognostic factors that predict adherence are measured. 

In contrast, there is no confounding for the effect of assigned treatment 
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Fine Point 9.2 


Per-protocol analyses. In randomized trials, two common attempts to estimate the per-protocol effect of treatment 
A are ‘as treated’ and ‘per protocol’ analyses. 

A conventional as-treated analysis compares the distribution of the outcome Y in those who received treatment 
(A = 1) versus those who did not receive treatment (A = 0), regardless of their treatment assignment Z. Clearly, a 
conventional as-treated comparison will be confounded if the reasons that moved participants to take treatment were 
associated with prognostic factors U that were not measured, as in Figures 9.11 and 9.12. On the other hand, consider 
a setting in which all backdoor paths between A and Y can be blocked by conditioning on measured factors L, as in 
Figure 9.13. Then an as-treated analysis will succeed in estimating the per-protocol effect if it appropriately measures 
and adjusts for the factors L. 

A conventional per-protocol analysis—also referred to as an on-treatment analysis—only includes individuals who 
adhered to the study protocol: the so-called per-protocol population of participants with A = Z. The analysis then 
compares, in the per-protocol population only, the distribution of the outcome Y in those who were assigned to treatment 
(Z = 1) versus those who were not assigned to treatment (Z = 0). That is, a conventional per-protocol analysis, which 
is just an intention-to-treat analysis restricted to the per-protocol population, will generally result in a biased estimate 
of the per-protocol effect. To see why, consider the causal diagram in Figure 9.14, which includes an indicator of 
selection S into the per-protocol population: S = 1 if A = Z and S = 0 otherwise. Selection bias will arise unless the 
per-protocol analysis appropriately measures and adjusts for the factors L. 

That is, as-treated and per-protocol analyses are observational analyses of a randomized experiment and, like any 
observational analysis, require appropriate adjustment for confounding and selection bias to obtain valid estimates of 
the per-protocol effect. For examples and additional discussion, see Hernán and Herndndez-Diaz (2012). 





Z. Because Z is randomly assigned, exchangeability Y*1LZ holds for the 
assigned treatment Z even if it does not hold for the received treatment A. 
There are no backdoor paths from Z to Y in Figure 9.11. Association between 
Z and Y implies a causal effect of Z on Y, whether or not all individuals 
adhere to the assigned treatment. The associational risk ratio Pr[Y = 1|Z = 


The analysis that estimates the un- 
adjusted association between Z and 
Y to estimate the intention-to-treat 
effect is referred to as an intention- 
to-treat analysis. See Fine Point 
9.4 for more on intention-to-treat 
analyses. 


In statistical terms, the intention- 
to-treat analysis provides a valid— 
though perhaps underpowered—a- 
level test of the null hypothesis of 
no average treatment effect. 


1|/Pr[Y = 1|Z = 0] equals the causal intention-to-treat risk ratio Pr[ Y=! = 
1]/ Pr[Y7= = 1]. 

The lack of confounding largely explains why the intention-to-treat effect is 
privileged in many randomized experiments: “the effect of having the intention 
of treating with A” may not measure the treatment effect that we want— “the 
effect of treating with A” or the per-protocol effect—but it is easier to compute 
correctly than the per-protocol effect. As often occurs when a less interesting 
quantity is easier to compute than a more interesting quantity, we tend to 
come up with arguments to justify the use of the less interesting quantity. 
The intention-to-treat effect is no exception. We now discuss why several well- 
known justifications for the intention-to-treat effect need to be taken with a 
grain of salt. See also Fine Point 9.4. 


A common justification for the intention-to-treat effect is that it preserves 
the null. That is, if treatment A has a null effect on Y, then assigned treatment 
Z will also have a null effect on Y. Null preservation is a key property because 
it ensures no effect will be declared when no effect exists. More formally, under 
the sharp causal null hypothesis and the exclusion restriction, it can be shown 
that Pr[Y = 1|Z = 1]/Pr[Y = 1|Z = 0] = Pry ea 1]/ Pr[Y°™ = 1] = 1. 
However, this equality is not true when the exclusion restriction does not hold, 
as represented in Figure 9.11. In those cases—experiments that are not double- 
blind placebo-controlled—the effect of A may be null while the effect of Z is 
non-null. To see that, mentally erase the arrow A — Y in Figure 9.11: there 
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Fine Point 9.3 


Pseudo-intention-to-treat analysis. The intention-to-treat effect can only be directly computed from an intention-to- 
treat analysis if there are no losses to follow-up or other forms of censoring. When some individuals do not complete the 
follow-up, their outcomes are unknown and thus the analysis needs to be restricted to individuals with complete follow- 
up. Thus, we can only conduct a pseudo-intention-to-treat analysis Pr[Y = 1|Z = 1,C = 0]/ Pr[Y = 1|Z =0,C = 0] 
where C = 0 indicates that an individual remained uncensored until the measurement of Y. As described in Chapter 8, 
censoring may induce selection bias and thus the pseudo-intention-to-treat estimate may be a biased estimate, in either 
direction, of the intention-to-treat effect. In the presence of loss to follow-up or other forms of censoring, the analysis 
of randomized experiments requires appropriate adjustment for selection bias even to compute the intention-to-treat 
effect. For additional discussion, see Little et al (2012). 





is still an arrow from Z to Y. 


—_—_— oS A related justification for the intention-to-treat effect is that its value is 
Z L—A—~Y guaranteed to be closer to the null than the value of the per-protocol effect. 
The intuition is that imperfect adherence results in an attenuation—not an 

exaggeration—of the effect. Therefore, the intention-to-treat risk ratio Pr[Y = 


1|Z = 1]|/Pr[Y = 1|Z = 0] will have a value between 1 and that of the per- 
protocol risk ratio Pr[Y°=! = 1]/Pr[Y¢=° = 1]. The intention-to-treat effect 
can thus be interpreted as a lower bound for the per-protocol effect, i.e., as 
a conservative effect estimate. There are, however, three problems with this 
answer. 


Figure 9.13 





[s] First, this justification assumes monotonicity of effects (see Technical Point 
5.2), that is, that the treatment effect is in the same direction for all individuals. 
gee ts ye If this were not the case and the degree of non-adherence were high, then the 
Z L— A —>Y per-protocol effect may be closer to the null than the intention-to-treat effect. 
For example, suppose that 50% of the individuals assigned to treatment did 
Pi not adhere (e.g., because of mild adverse effects after taking a couple of pills), 
and that the direction of the effect is opposite in those who did and did not 
U adhere. Then the intention-to-treat effect would be anti-conservative. 
Figure 9.14 Second, suppose the effects are monotonic. The intention-to-treat effect 
may be conservative in placebo-controlled experiments, but not necessarily in 
head-to-head trials in which individuals are assigned to two active treatments. 
Suppose individuals with a chronic and painful disease were randomly assigned 
to either an expensive drug (Z = 1) or ibuprofen (Z = 0). The goal was to de- 
termine which drug results in a lower risk of severe pain Y after 1 year of follow- 
up. Unknown to the investigators, both drugs are equally effective to reduce 
pain, that is, the per-protocol (causal) risk ratio Pr[Y¢=! = 1]/ Pr[Y °=? = 1] 
is 1. However, adherence to ibuprofen happened to be lower than adherence 
to the expensive drug because of a mild, easily palliated side effect. As a re- 
sult, the intention-to-treat risk ratio Pr[Y = 1|Z = 1]/Pr[Y = 1|Z = 0] was 
greater than 1, and the investigators wrongly concluded that ibuprofen was 
less effective than the expensive drug to reduce severe pain. 
A similar argument against con- Third, suppose the intention-to-treat effect is indeed conservative. Then 
servative intention-to-treat analy- the intention-to-treat effect is a dangerous effect measure when the goal is 
ses applies to non-inferiority trials, evaluating a treatment’s safety: one could naively conclude that a treatment A 
in which the goal is to demonstrate is safe because the intention-to-treat effect of Z on the adverse outcome is close 
that one treatment is not inferior to to null, even if treatment A causes the adverse outcome in a significant fraction 
the other. of the patients. The explanation may be that many individuals assigned to 

Z = 1 did not take, or stopped taking, the treatment before developing the 
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Fine Point 9.4 


Effectiveness versus efficacy. Some authors refer to the per-protocol effect, e.g., Pr[Y°=' = 1]/ Pr[Y*=° = 1] as 
the treatment’s “efficacy,” and to the intention-to-treat effect, e.g., Pr[Y*=! = 1|/Pr[Y*=° = 1], as the treatment’s 
“effectiveness.” A treatment’s “efficacy” closely corresponds to what we have referred to as the average causal effect of 
treatment A in an ideal randomized experiment. In contrast, a treatment’s “effectiveness” would correspond to the effect 
of assigning treatment Z in a setting in which the interventions under study will not be optimally implemented, typically 
because a fraction of study individuals will not adhere. Using this terminology, it is often argued that “effectiveness” is 
the most realistic measure of a treatment’s effect because “effectiveness” includes any effects of treatment assignment Z 
not mediated through the received treatment A, and already incorporates the fact that people will not perfectly adhere 
to the assigned treatment. A treatment's “efficacy,” on the other hand, does not reflect a treatment’s effect in real 
conditions. Thus it is claimed that one is justified to report the intention-to-treat effect as the primary finding from a 
randomized experiment not only because it is easy to compute, but also because “effectiveness” is the truly interesting 
effect measure. 

Unfortunately, the above argumentation is problematic. First, the intention-to-treat effect measures the effect of 
assigned treatment under the adherence conditions observed in a particular experiment. The actual adherence in real 
life may be different (e.g., participants in a study may adhere better if they are closely monitored), and may actually 
be affected by the findings from that particular experiment (e.g., people will be more likely to adhere to a treatment 
after they learn it works). Second, the above argumentation implies that we should refrain from conducting double-blind 
randomized clinical trials because, in real life, both patients and doctors are aware of the received treatment. Thus a true 
“effectiveness” measure should incorporate the effects stemming from assignment awareness (e.g., behavioral changes) 
that are eliminated in double-blind randomized experiments. Third, individual patients who are planning to adhere to 
the treatment prescribed by their doctors will be more interested in the per-protocol effect than in the intention-to-treat 
effect. For more details, see the discussion by Hernán and Herndndez-Diaz (2012). 


adverse outcome. 

Thus the exclusive reporting of intention-to-treat effect estimates as the 
findings from a randomized experiment is hard to justify for experiments with 
substantial non-adherence, and for those aiming at estimating harms rather 
than benefits. Unfortunately, computing the per-protocol effect requires ad- 
justment for confounding under the assumption of exchangeability conditional 
on the measured covariates, or via instrumental variable estimation (a partic- 

For a non-technical discussion of ular case of g-estimation, see Chapter 16) under alternative assumptions. 


per-protocol effects in complex ran- Our discussion of per-protocol has been necessarily oversimplified because 
domized experiments, see Herndn we have not yet introduced time-varying treatments in this book. When, as 
and Robins (2017). often happens, treatment can vary over time in a randomized experiment, 


we define the per-protocol effect as the effect that would have been observed 
if everyone had adhered to their assigned treatment strategy throughout the 
follow-up. Part III describes the concepts and methods that are required to 
define and estimate per-protocol effects in the general case. 

In summary, in the analysis of randomized experiments there is trade-off 
between bias due to potential unmeasured confounding—when choosing the 
per-protocol effect—and misclassification bias—when choosing the intention- 
to-treat effect. Reporting only the intention-to-treat effect implies preference 
for misclassification bias over confounding, a preference that needs to be jus- 
tified in each application. 


Chapter 10 
RANDOM VARIABILITY 


Suppose an investigator conducted a randomized experiment to answer the causal question “does one’s looking 
up to the sky make other pedestrians look up too?” She found an association between her looking up and other 
pedestrians’ looking up. Does this association reflect a causal effect? By definition of randomized experiment, 
confounding bias is not expected in this study. In addition, no selection bias was expected because all pedestrians’ 
responses—whether they did or did not look up—were recorded, and no measurement bias was expected because 
all variables were perfectly measured. However, there was another problem: the study included only 4 pedestrians, 
2 in each treatment group. By chance, 1 of the 2 pedestrians in the “looking up” group, and neither of the 2 
pedestrians in the “looking straight” group, was blind. Thus, even if the treatment (the investigator’s looking 
up) truly had a strong average effect on the outcome (other people’s looking up), half of the individuals in the 
treatment group happened to be immune to the treatment. The small size of the study population led to a dilution 
of the estimated effect of treatment on the outcome. 

There are two qualitatively different reasons why causal inferences may be wrong: systematic bias and ran- 
dom variability. The previous three chapters described three types of systematic biases: selection bias, measure- 
ment bias—both of which may arise in observational studies and in randomized experiments—and unmeasured 
confounding—which is not expected in randomized experiments. So far we have disregarded the possibility of 
bias due to random variability by restricting our discussion to huge study populations. In other words, we have 
operated as if the only obstacles to identify the causal effect were confounding, selection, and measurement. It is 
about time to get real: the size of study populations in etiologic research rarely precludes the possibility of bias 
due to random variability. This chapter discusses random variability and how we deal with it. 


10.1 Identification versus estimation 


The first nine chapters of this book are concerned with the computation of 
causal effects in study populations of near infinite size. For example, when 
computing the causal effect of heart transplant on mortality in Chapter 2, we 
only had a twenty-person study population but we regarded each individual 
in our study as representing 1 billion identical individuals. By acting as if 
we could obtain an unlimited number of individuals for our studies, we could 
ignore random fluctuations and could focus our attention on systematic biases 
due to confounding, selection, and measurement. Statisticians have a name for 
problems in which we can assume the size of the study population is effectively 
infinite: identification problems. 

Thus far we have reduced causal inference to an identification problem. Our 
only goal has been to identify (or, as we often said, to compute) the average 
causal effect of treatment A on the outcome Y. The concept of identifiability 
was first described in Section 3.1—and later discussed in Sections 7.2 and 
8.4—where we also introduced some conditions generally required to identify 
causal effects even if the size of the study population could be made arbitrarily 
large. These so-called identifying conditions were exchangeability, positivity, 
and consistency. 

Our ignoring random variability may have been pedagogically convenient 
to introduce systematic biases, but also extremely unrealistic. In real research 
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For an introduction to statistics, 
see the book by Wasserman (2004). 
For a more detailed introduction, 
see Casella and Berger (2002). 


Random variability 


projects, the study population is not effectively infinite and hence we cannot 
ignore the possibility of random variability. To this end let us return to our 
twenty-person study of heart transplant and mortality in which 7 of the 13 
treated individuals died. 

Suppose our study population of 20 can be conceptualized as being a ran- 
dom sample from a super-population so large compared with the study popu- 
lation that we can effectively regard it as infinite. Further, suppose our goal is 
to make inferences about the super-population. For example, we may want 
to make inferences about the super-population probability (or proportion) 
Pr[Y = 1|A =a]. We refer to the parameter of interest in the super-population, 
the probability Pr[Y = 1|A = a] in this case, as the estimand. An estimator 
is a rule that takes the data from any sample from the super-population and 
produces a numerical value for the estimand. This numerical value for a par- 
ticular sample is the estimate from that sample. The sample proportion of 
individuals that develop the outcome among those receiving treatment level 
a, Pr[Y = 1 | A = al, is an estimator of the super-population probability 
Pr[Y = 1|A =a]. The estimate from our sample is Pr[Y =1| A= q] = 7/13. 
More specifically, we say that 7/13 is a point estimate. The value of the esti- 
mate will depend on the particular 20 individuals randomly sampled from the 
super-population. 

As informally defined in Chapter 1, an estimator is consistent for a par- 
ticular estimand if the estimates get (arbitrarily) closer to the parameter as 
the sample size increases (see Technical Point 10.1 for the formal definition). 
Thus the sample proportion Pr[Y = 1 | A = aj consistently estimates the 
super-population probability Pr[Y = 1|A = al, i.e., the larger the num- 
ber n of individuals in our study population, the smaller the magnitude of 
Pr[Y = 1|A = a] — Pr[Y = 1 | A = aj] is expected to be. Previous chap- 
ters were exclusively concerned with identification; from now on we will be 
concerned with statistical estimation. 

Even consistent estimators may result in point estimates that are far from 
the super-population value. Large differences between the point estimate and 
the super-population value of a proportion are much more likely to happen 
when the size of the study population is small compared with that of the super- 
population. Therefore it makes sense to have more confidence in estimates 
that originate from larger study populations. In the absence of systematic 
biases, statistical theory allows one to quantify this confidence in the form of a 
confidence interval around the point estimate. The larger the size of the study 
population, the narrower the confidence interval. A common way to construct 
a 95% confidence interval for a point estimate is to use a 95% Wald confidence 
interval centered at a point estimate. It is computed as follows. 

First, estimate the standard error of the point estimate under the assump- 
tion that our study population is a random sample from a much larger super- 
population. Second, calculate the upper limit of the 95% Wald confidence 
interval by adding 1.96 times the estimated standard error to the point esti- 
mate, and the lower limit of the 95% confidence interval by subtracting 1.96 
times the estimated standard error from the point estimate. For example, con- 
sider our estimator Pr[Y = 1 | A = a] = 6 of the super-population parameter 


Pr[Y = 1|A = a] = p. Its standard error is 4/ pp) (the standard error of a 


binomial) and thus its estimated standard error is y Ê G-p d= y ce 13)(6/ 13) _ 


0.138. Recall that the Wald 95% confidence interval for a parameter @ based 
on an estimator 0 is 041.96 x & (ê) where sè (9) is an estimate of the (exact 











10.1 Identification versus estimation 


A Wald confidence interval cen- 
tered at p is only guaranteed to be 
valid in large samples. For simplic- 
ity, here we assume that our sample 
size is sufficiently large for the va- 
lidity of our Wald interval. 


In contrast with a frequentist 95% 
confidence interval, a Bayesian 95% 
credible interval can be interpreted 
as “there is a 95% probability that 
the estimand is in the interval”. 
However, for a Bayesian, probabil- 
ity is defined not as a frequency 
over hypothetical repetitions but as 
degree-of-belief. In this book we 
adopt the frequency definition of 
probability. See Fine Point 11.2 for 
more on Bayesian intervals. 


There are many valid large-sample 
confidence intervals other than the 
Wald interval (Casella and Berger, 
2002). One of these might be pre- 
ferred over the Wald interval, which 
can be badly anti-conservative in 
small samples (Brown et al, 2001). 
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or large sample) standard error of @ and 1.96 is the upper 97.5% quantile of 
a standard normal distribution with mean 0 and variance 1. Therefore the 
95% Wald confidence interval for our estimate is 0.27 to 0.81. The length and 
centering of the 95% Wald confidence interval will vary from sample to sample. 

A 95% confidence interval is calibrated if the estimand is contained in the 
interval in 95% of random samples, conservative if the estimand is contained in 
more than 95% of samples, and anticonservative otherwise. We will say that a 
confidence interval is valid if, for any value of the true parameter, the interval 
is either calibrated or conservative, i.e. it covers the true parameter at least 
95% of the time. We would like to choose the valid interval whose width is 
narrowest. 

The validity of confidence intervals is defined in terms of the frequency of 
coverage in repeated samples from the super-population, but we only see one 
of those samples when we conduct a study. Why should we care about what 
would have happened in other samples that we did not see? One important 
answer is that the definition of confidence interval also implies the following. 
Suppose we and all of our colleagues keep conducting research studies for the 
rest of our lifetimes. In each new study, we construct a valid 95% confidence 
interval for the parameter of interest. Then, at the end of our lives, we can look 
back at all the studies that were conducted, and conclude that the parameters 
of interest were trapped in—or covered by—the confidence interval in at least 
95% of the studies. Unfortunately, we will have no way of identifying the (up 
to) 5% of the studies in which the confidence interval failed to include the 
super-population quantity. 

Importantly, the 95% confidence interval from a single study does not im- 
ply that there is a 95% probability that the estimand is in the interval. In 
our example, we cannot conclude that the probability that the estimand lies 
between the values 0.27 and 0.81 is 95%. The estimand is fixed, which implies 
that either it is or it is not included in the particular interval (0.27, 0.81). 
In this sense, the probability that the estimand is included in that interval is 
either 0 or 1. A confidence interval only has a frequentist interpretation. Its 
level (e.g., 95%) refers to the frequency with which the interval will trap the 
unknown super-population quantity of interest over a collection of studies (or 
in hypothetical repetitions of a particular study). 

Confidence intervals are often classified as either small-sample or large- 
sample confidence intervals. A small-sample valid (conservative or calibrated) 
confidence interval is one that is valid at all sample sizes for which it is de- 
fined. Small-sample calibrated confidence intervals are sometimes called ex- 
act confidence intervals. A large-sample (equivalently, asymptotic) valid con- 
fidence interval is one that is valid only in large samples. A large-sample 
calibrated 95% confidence interval is one whose coverage becomes arbitrarily 
close to 95% as the sample size increases. The Wald confidence interval for 
Pr[Y = 1|A = a] = p mentioned above is a large-sample calibrated confidence 
interval, but not a small-sample valid interval. (There do exist small-sample 
valid confidence intervals for p, but they are not often used in practice.) When 
the sample size is small, a valid large-sample confidence interval, such as the 
Wald 95% confidence interval of our example above, may not be valid. In this 
book, when we use the term 95% confidence interval, we mean a large-sample 
valid confidence interval, like a Wald interval, unless stated otherwise. See also 
Fine Point 10.1. 

However, not all consistent estimators can be used to center a valid Wald 
confidence interval, even in large samples. Most users of statistics will consider 
an estimator unbiased if it can center a valid Wald interval and biased if it 
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Fine Point 10.1 


Honest confidence intervals. The smallest sample size at which a large-sample, valid 95% confidence interval covers 
the true parameter at least 95% of the time may depend on the unknown value of the true parameter. We say a 
large-sample valid 95% confidence interval is uniform or honest if there exists a sample size n at which the interval is 
guaranteed to cover the true parameter value at least 95% of the time, whatever be the value of the true parameter. We 
demand honest intervals because, in the absence of uniformity, at any finite sample size there may be data generating 
distributions under which the coverage of the true parameter is much less than 95%. Unfortunately, for a large-sample, 
honest confidence interval, the smallest such n is generally unknown and is difficult to determine even by simulation. 
See Robins and Ritov (1997) for technical details. 

In the remainder of the text, when we refer to valid confidence intervals, we will mean large-sample honest confidence 
intervals. By definition, any small-sample valid confidence interval is uniform or honest for all n for which the interval 
is defined. 


cannot (see Technical Point 10.1 for details). For now, we will equate the term 
bias with the inability to center valid Wald confidence intervals. Also, bear in 
mind that confidence intervals only quantify uncertainty due to random error, 
and thus the confidence we put on confidence intervals may be excessive in the 
presence of systematic biases (see Fine Point 10.2 for details). 


10.2 Estimation of causal effects 


Suppose our heart transplant study was a marginally randomized experiment, 
and that the 20 individuals were a random sample of all individuals in a nearly 
infinite super-population of interest. Suppose further that all individuals in 
the super-population were randomly assigned to either A = 1 or A = 0, and 
that all of them adhered to their assigned treatment. Exchangeability of the 
treated and the untreated would hold in the super-population, i.e., Pr[Y* = 
1] = Pr[Y = 1|A = a], and therefore the causal risk difference Pr[Y °=} = 1] — 
Pr[Y °=? = 1] equals the associational risk difference Pr[Y = 1|A = 1]—Pr[Y = 
1|A = 0] in the super-population. 

Because our study population is a random sample of the super-population, 
the sample proportion of individuals that_develop the outcome among those 
with observed treatment value A = a, Pr[Y = 1 | A = al, is an unbiased 
estimator of the super-population probability Pr[Y = 1|A = a]. Because of 
exchangeability in the super-population, the sample proportion Pr[Y = 1 
A = q] is also an unbiased estimator of Pr[Y* = 1]. Thus testing the causal 
null hypothesis Pr[Y¢=! = 1] = Pr[Y¢=° = 1] boils down to comparing, via 
standard statistical procedures, the sample proportions Pr Y =1|A=1] = 
7/13 and Pr[Y =1|A=0] = 3/7. Standard statistical methods can also 
be used to compute 95% confidence intervals for the causal risk difference and 
causal risk ratio in the super-population, which are estimated by (7/13) — (3/7) 
and (7/13) /(3/7), respectively. Slightly more involved, but standard, statistical 
procedures are used in observational studies to obtain confidence intervals for 
standardized, IP weighted, or stratified association measures. 

There is an alternative way to think about sampling variability in random- 
ized experiments. Suppose only individuals in the study population, not all 
individuals in the super-population, are randomly assigned to either A = 1 
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Technical Point 10.1 


Bias and consistency in statistical inference. We have discussed systematic bias (due to unknown sources of 
confounding, selection, or measurement error) and consistent estimators in earlier chapters. Here we discuss these and 
other concepts of bias, and describe how they are related. 

To provide a formal definition of consistent estimator for an estimand 6, suppose we observe n independent, iden- 
tically distributed (i.i.d.) copies of a vector-valued random variable whose distribution P lies in a set M of distributions 
(our model). Then the estimator 6, is consistent for 0 = 0 (P) in model M if On converges to @ in probability for every 
PEAGE 7 

Prp [lên — 9 (P) | >e| +0 asn— œ for every € > 0,P € M. 


The estimator bn is exactly unbiased in model M if, for every P € M, Ep [n] = 0 (P). The exact bias under P is 


the difference Ep [n] — 0 (P). We denote the estimator by On rather than by simply 0 to emphasize that the estimate 
depends on the sample size n. On the other hand, the parameter 0 (P) is a fixed, though unknown, quantity depending 
on P € M. When P is the distribution generating the data in our study, we often suppress the P in the notation and 
, €.g., write E [ôn] = 6. For many parameters 0, such as the risk ratio Pr[Y = 1|A = 1]/Pr[Y = 1|A = 0], exactly 
unbiased estimators do not exist. 

A systematically biased estimator is neither consistent nor exactly unbiased. Robins and Morgenstern (1987) 
argue that most applied researchers (e.g., epidemiologists) will declare an estimator unbiased only if it can center a 
valid Wald confidence interval. They show that under this definition, an estimator is only unbiased if it is uniformly 
asymptotic normal and unbiased (UANU), as only UANU estimators can center valid standard Wald intervals for 0 (P) 
under the model M. An estimator 6,, is UANU in model M if there exists sequences on (P) such that the z-statistic 


(@ —6 (P)) /On (P) converges uniformly to a standard normal random variable in the following sense: for t € R, 





Sete |n" (8n -0 (P)) /on(P) < t] ® (t)| — 0 as n — co 


where ® (t) is the standard normal cumulative distribution function (Robins and Ritov,1997). 

All inconsistent estimators and some consistent estimators (see Chapter 18 for examples), are biased under this 
definition. In the text, whenever we say an estimator is unbiased (without further qualification) we mean that it is 
UANU. 





or A = 0. Because of the presence of random sampling variability, we do 
not expect that exchangeability will exactly hold in our sample. For example, 
suppose that only the 20 individuals in our study were randomly assigned to 
either heart transplant (A = 1) or medical treatment (A = 0). Suppose further 
that each individual can be classified as good or bad prognosis at the time of 
randomization. We say that the groups A = 0 and A = 1 are exchangeable 
if they include exactly the same proportion of individuals with bad prognosis. 
By chance, it is possible that 2 out of the 13 individuals assigned to A = 1 
and 3 of the 7 individuals assigned to A = 0 had bad prognosis. However, if 
we increased the size of our sample then there is a high probability that the 
relative imbalance between the groups A = 1 and A = 0 would decrease. 
Under this conceptualization, there are two possible targets for inference. 
First, investigators may be agnostic about the existence of a super-population 
and restrict their inference to the sample that was actually randomized. This is 
referred to as randomization-based inference, and requires taking into account 
See Robins (1988) for a discussion some technicalities that are beyond the scope of this book. Second, investiga- 
of randomization-based inference. tors may still be interested in making inferences about the super-population 
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Fine Point 10.2 


Uncertainty from systematic biases. The width of the usual Wald-type confidence intervals is a function of the 
standard error of the estimator and thus reflects only uncertainty due to random error. However, the possible presence 
of systematic bias due to confounding, selection, or measurement is another important source of uncertainty. The larger 
the study population, the smaller the random error is both absolutely and as a proportion of total uncertainty, and hence 
the more the usual Wald confidence interval will understate the true uncertainty. 

The stated 95% confidence in a 95% confidence interval becomes overconfidence as population size increases 
because the interval excludes uncertainty due to systematic biases, which are not diminished by increasing the sample 
size. As a consequence, some authors advocate referring to such intervals by a less confident name, calling them 
compatibility intervals instead. The renaming recognizes that such intervals can only show us which effect sizes are 
highly compatible with the data under our adjustment assumptions and methods (Amrhein et al. 2019; Greenland 2019). 
The compatibility concept is weaker than the confidence concept, for it does not demand complete confidence that our 
adjustment removes all systematic biases. 

Regardless of the name of the intervals, the uncertainty due to systematic bias is usually a central part of the 
discussion section of scientific articles. However, most discussions revolve around informal judgments about the potential 
direction and magnitude of the systematic bias. Some authors argue that quantitative methods need to be used 
to produce intervals around the effect estimate that integrate random and systematic sources of uncertainty. These 
methods are referred to as quantitative bias analysis. See the book by Lash, Fox, and Fink (2009). Bayesian alternatives 
are discussed by Greenland and Lash (2008), and Greenland (2009a, 2009b). 





from which the study sample was randomly drawn. From an inference stand- 
point, this latter case turns out to be mathematically equivalent to the con- 
ceptualization of sampling variability described at the start of this section in 
which the entire super-population was randomly assigned to treatment. That 
is, randomization followed by random sampling is equivalent to random sam- 
pling followed by randomization. 

In many cases we are not interested in the first target. To see why, consider 
a study that compares the effect of two first-line treatments on the mortality 
of cancer patients. After the study ends, we may determine that it is better 
to initiate one of the two treatments, but this information is now irrelevant 
to the actual study participants. The purpose of the study was not to guide 
the choice of treatment for patients in the study but rather for a group of 
individuals similar to—but larger than—the studied sample. Heretofore we 
have assumed that there is a larger group—the super-population—from which 
the study participants were randomly sampled. We now turn our attention to 
the concept of the super-population. 


10.3 The myth of the super-population 


As discussed in Chapter 1, there are two sources of randomness: sampling 

variability and nondeterministic counterfactuals. Below we discuss both. 
Consider our estimate Pr[Y = 1 | A = 1] = 6 = 7/13 of the super- 

population risk Pr[Y = 1|A = a] = p. Nearly all investigators would report a 


binomial confidence interval p+1.964/ Ê = 7/13+1.96/ CKA) for the 
probability p. If asked why these intervals, they would say it is to incorporate 
the uncertainty due to random variability. But these intervals are valid only 
if Ð has a binomial sampling distribution. So we must ask when would that 
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Robins (1988) discussed these two happen. In fact there are two scenarios under which p has a binomial sampling 
scenarios in more detail. distribution. 


e Scenario 1. The study population is sampled at random from an es- 
sentially infinite super-population, sometimes referred to as the source 


The term i.i.d. used in Techni- or target population, and our estimand is the proportion p = Pr[Y = 
cal Point 10.1 means that our data 1|A = 1] of treated individuals who developed the outcome in the super- 
were a random sample of size n population. It is then mathematically true that, in repeated random 
from a super-population. samples of size 13 from the treated individuals in the super-population, 


the number of individuals who develop the outcome among the 13 is a 
binomial random variable with success probability Pr[Y = 1|A = 1]. As 
a result, the 95% Wald confidence interval calculated in the previous sec- 
tion is asymptotically calibrated for Pr[Y = 1|A = 1]. This is the model 
we have considered so far. 


e Scenario 2. The study population is not sampled from any super-population. 
Rather (i) each individual ¿ among the 13 treated individuals has an indi- 
vidual nondeterministic (stochastic) counterfactual probability p=! (ii) 
the observed outcome Y; = Y,°~! for subject i occurs with probabil- 
ity p?=! and (iii) p?=1 takes the same value, say p, for each of the 13 
treated individuals. Then the number of individuals who develop the 
outcome among the 13 treated is a binomial random variable with suc- 
cess probability p. As a result, the 95% confidence interval calculated in 
the previous section is asymptotically calibrated for p. 


Scenario 1 assumes a hypothetical super-population. Scenario 2 does not. 
However, Scenario 2 is untenable because the probability p?=! of developing 
the outcome when treated will almost certainly vary among the 13 treated in- 
dividuals due to between-individual differences in risk. For example we would 
expect the probability of death p?=! to have some dependence on an indi- 
vidual’s genetic make-up. If the p?=! are nonconstant then the estimand of 
interest in the actual study population would generally be the average, say p, of 
the 13 p?=!. But in that case the number of treated who develop the outcome 
is not a binomial random variable with success probability p, and the 95% con- 
fidence interval for p calculated in the previous section is not asymptotically 
calibrated but conservative. 

Therefore, any investigator who reports a binomial confidence interval for 
Pr[Y = 1|A = al], and who acknowledges that there exists between-individual 
variation in risk, must be implicitly assuming Scenario 1: the study individuals 
were sampled from a near-infinite super-population and that all inferences are 
concerned with quantities from that super-population. Under Scenario 1, the 
number with the outcome among the 13 treated is a binomial variable regard- 
less of whether the underlying counterfactual is deterministic or stochastic. 

An advantage of working under the hypothetical super-population scenario 
is that nothing hinges on whether the world is deterministic or nondetermin- 
istic. On the other hand, the super-population is generally a fiction; in most 
studies individuals are not randomly sampled from any near-infinite popula- 
tion. Why then has the myth of the super-population endured? One reason is 
that it leads to simple statistical methods. 

A second reason has to do with generalization. As we mentioned in the 
previous section, investigators generally wish to generalize their findings about 
treatment effects from the study population (e.g., the 20 individuals in our 
heart transplant study) to some large target population (e.g., all immortals in 
the Greek pantheon). The simplest way of doing so is to assume the study 
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population is a random sample from a large population of individuals who 
are potential recipients of treatment. Since this is a fiction, a 95% confi- 
dence interval computed under Scenario 1 should be interpreted as covering 
the super-population parameter had, often contrary to fact, the study individ- 
uals been sampled randomly from a near infinite super-population. In other 
words, confidence intervals obtained under Scenario 1 should be viewed as 
what-if statements. 

It follows from the above that an investigator might not want to entertain 
Scenario 1 if the size of the pool of potential recipients is not much larger 
than the size of the study population, or if the target population of potential 
recipients is believed to differ from the study population to an extent that 
cannot be accounted for by sampling variability. Here we will accept that 
individuals were randomly sampled from a super-population, and explore the 
consequences of random variability for causal inference in that context. We 
first explore this question in a simple randomized experiment. 


10.4 The conditionality “principle” 


(p) 


The estimated variance of th un- 





























adjusted estimator is Th + 
42 18 

Or ë= 21 The Wald 
95% confidence interval is then 
-0.15 (5) x 196 = 
(—0.26, —0.04). 
Table 10.1 

Y=1 Y=0 

A=1 24 96 

A=0 42 78 
Table 10.2 

L=1 Y=1 Y=0 

A=1 4 76 

A=0 2 38 

L=0 Y=1 Y=0 

A=1 20 20 

A=0 40 40 


Table 10.1 summarizes the data from a randomized trial to estimate the average 
causal effect of treatment A (1: yes, 0: no) on the 1-year risk of death Y (1: 
yes, 0: no). The experiment included 240 individuals, 120 in each treatment 
group. The associational risk difference is Pr[Y = 1|A = 1] — Pr[Y = 1|A = 
0] = 2 — x = —0.15. Suppose the experiment had been conducted in a 
super-population of near-infinite size, the treated and the untreated would be 
exchangeable, i.e., Y“1LA, and the associational risk difference would equal 
the causal risk difference Pr [Y*=1 = 1] — Pr [Y°=° = 1]. Suppose the study 
investigators computed a 95% confidence interval (—0.26,—0.04) around the 
point estimate —0.15 and published an article in which they concluded that 
treatment was beneficial because it reduced the risk of death by 15 percentage 
points. 

However, the study population had only 240 individuals and is therefore 
likely that, due to chance, the treated and the untreated are not perfectly 
exchangeable. Random assignment of treatment does not guarantee exact ex- 
changeability for the sample consisting of the 240 individuals in the trial; it only 
guarantees that any departures from exchangeability are due to random vari- 
ability rather than to a systematic bias. In fact, one can view the uncertainty 
resulting from our ignorance of the chance correlation between unmeasured 
baseline risk factors and the treatment A in the study sample as contributing 
to the length 0.22 of the confidence interval. 

A few months later the investigators learn that information on a third 
variable, cigarette smoking L (1: yes, 0: no), had also been collected and 
decide to take a look at it. The study data, stratified by L, is shown in Table 
10.2. Unexpectedly, the investigators find that the probability of receiving 
treatment for smokers (80/120) is twice that for nonsmokers (40/120), which 
suggests that the treated and the untreated are not exchangeable and thus 
that adjustment for smoking is necessary. When the investigators adjust via 
stratification, the associational risk difference in smokers, Pr[Y = 1|A = 1, L = 
1] — Pr[Y = 1|A = 0, L = 1], is equal to 0. The associational risk difference in 
nonsmokers, Pr[Y = 1|A = 1, L = 0] — Pr[Y = 1|A = 0, L = 0], is also equal 
to 0. Treatment has no effect in both smokers and nonsmokers, even though 
the marginal risk difference —0.15 suggested a net beneficial effect in the study 
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Technical Point 10.2 


A formal statement of the conditionality principle. The likelihood for the observed data has three factors: the 
density of Y given A and L, the density of A given L, and the marginal density of L. Consider a simple example with 
one dichotomous L, exchangeability Y* 1 A|L, the stratum-specific risk difference sRD = Pr (Y = 1|L =l, A = 1) — 
Pr (Y = 1|L = l, A = 0) known to be constant across strata of L, and in which the parameter of interest is the stratum- 
specific causal risk difference. Then the likelihood of the data is 








n 


[]f YiL: Ai; sRD, po) x f (AilLis a) x f (Li p) 


i=1 





where po = (poi, Po2) with poz Pr(Y = 1|L =1,A=0), a, and p are nuisance parameters associated with the 
conditional density of Y given A and L, the conditional density of A given L, and the marginal density of L, respectively. 
See, for example, Casella and Berger (2002). 

The data on A and L are said to be exactly ancillary for the parameter of interest when, as in this case, the 
distribution of the data conditional on these variables depends on the parameter of interest, but the joint density of A 
and L does not share parameters with f (Y;|L;, Aj;sRD, po). The conditionality principle states that one should always 
perform inference on the parameter of interest conditional on any ancillary statistics. Thus one should condition on the 
ancillary statistic {A;, L;; i = 1,..,n}. Analogously, if the risk ratio (rather than the risk difference) were known to be 
constant across strata of L, {A;, Li;i = 1,..,} remains ancillary for the risk ratio. 





The estimated variance of the ad- population. 


justed estimator is described in These new findings are disturbing to the investigators. Either someone did 

Technical Point 10.5. The Wald not assign the treatment at random (malfeasance) or randomization did not 

95% confidence interval is then result in approximate exchangeability (very very bad luck). A debate ensues 

(—0.076, 0.076). among the investigators. Should they retract their article and correct the 
results? They all agree that the answer to this question would be affirmative 
if the problem were due to malfeasance. If that were the case, there would 
be confounding by smoking and the effect estimate should be adjusted for 
smoking. But they all agree that malfeasance is impossible given the study’s 
quality assurance procedures. It is therefore clear that the association between 
smoking and treatment is entirely due to bad luck. Should they still retract 
their article and correct the results? 


One investigator says that they should not retract the article. His argument 
goes as follows: “Okay, randomization went wrong for smoking, but why should 
we privilege the adjusted over the unadjusted estimator? It is likely that 
imbalances on other unmeasured factors U cancelled out the effect of the chance 
imbalance on L, so that the unadjusted estimator is still the closer to the true 
value in the super-population.” A second investigator says that they should 
retract the article and report the adjusted null result. Her argument goes as 
follows: “We should adjust for L because the strong association between L and 
A introduces confounding in our effect estimate. Within levels of L, we have 
mini randomized trials and the confidence intervals around the corresponding 
point estimates will reflect the uncertainty due to the possible U-A associations 
conditional on L.” 

To determine which investigator is correct, here are the facts of the matter. 
Suppose, for simplicity, the true causal risk difference is constant across strata 
of L, and suppose we could run the randomized experiment trillions of times. 
We then select only (i.e., condition on) those runs in which smoking L and 
treatment A are as strongly positively associated as in the observed data. We 
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Approximate ancillarity. Suppose that the stratum-specific risk difference (sRD,) is known to vary over strata of L. 
Under our usual identifiability assumptions, the causal risk difference in the population is identified by the standardized 
risk difference 
RDa =X [Pr (Y = 1|L =1, A = 1;v) — Pr (Y = 1|L =). A = 0;v)] f (l; p) 

l 
which depends on the parameters v = {sRD;, po 1;l = 0,1} and p (see Technical Point 10.2). In unconditionally 
randomized experiments, RDstąa equals the associational RD, Pr (Y = 1|A = 1) — Pr (Y = 1|A = 0), because ALLL 
in the super-population. Due to the dependence of RDsta on p, {A;, Li;i = 1,.. n} is no longer exactly ancillary and 
in fact no exact ancillary exists. 


Consider the statistic 5 = ORAL — ORar where ORazr = ORaz (a) = e SS is the A-L 


odds ratio in the super-population, and ORaz is is ORaz but with the the population proportions Pr (A = a|L = l;a) 
replaced by the empirical sample proportions Pr(A =aļ|L = D). S is asymptotically normal with mean 0. Let 9 = 
S/5e(S), where S@(9) is an estimate of the standard error of S. The distribution of Ș converges to a standard normal 
distribution in large samples, so that S quantifies the A-L association in the data on a standardized scale. For example, 
if S = 2, then S is two standard deviations above its (asymptotic) expected value of 0. 

When the true value of ORyz is known, S is referred to as an approximate (or large sample) ancillary statistic. 
To see why, consider a randomized experiment with OR4, = 1. Then 5, like an exact ancillary statistic, i) can be 
computed from the data (i.e., 8 = (ORaL — 1) /8e(S)), ii) S= S (a) depends on a parameter a that does not 
occur in the estimand of interest, iii) the likelihood factors into a term f (A|L; œ) that depends only on aœ and a term 
f (YIL, A; v) f (L; p) that does not depend on a, and iv) conditional on S, the adjusted estimate of RDsta is unbiased, 
while the unadjusted estimate of RDstą is biased (Technical Point 10.4 defines and compares adjusted and unadjusted 


Pr(A=1|L=1) 


== — 1, can be used in 
Pr(A=1|L=0) : 


estimators). Any other statistic that quantifies the A-L association in the data, e.g., 


place of S. 

Now consider a continuity principle wherein inferences about an estimand should not change discontinuously in 
response to an arbitrarily small known change in the data generating distribution (Buehler 1982). If one accepts both 
the conditionality and continuity principles, then one should condition on an approximate ancillary statistic. For example, 
when ORaz = 1 is known, the continuity principle would be violated if, following the conditionality principle, we treated 
the unadjusted estimate of RDsta as biased when sRD, was known to be a constant, but treated it as unbiased when 
the sRD; were almost constant. We will say that a researcher who always conditions on both exact and approximate 
ancillaries follows the extended conditionality principle. 





would find that, within each level of L, the fraction of these runs in which 
any given risk factor U for Y was positively associated with A essentially 
equals the number of runs in which it was negatively associated. (This is true 
even if U and L are highly correlated in both the super-population and in 
the study data.) As a consequence, the adjusted estimate of the treatment 
effect is unbiased but the unadjusted estimate is greatly biased when averaged 
over these runs. Unconditionally—over all the runs of the experiment—both 
the unadjusted and adjusted estimates are unbiased but the variance of the 
The unconditional efficiency of the adjusted estimate is smaller than that of the unadjusted estimate. That is, the 
adjusted estimator results from the adjusted estimator is both conditionally unbiased and unconditionally more 
adjusted estimator being the maxi- efficient. Hence either from the conditional or unconditional point of view, the 
mum likelihood estimator (MLE) of | Wald interval centered on the adjusted estimator is the better analysis and the 
the risk difference when data on L article needs to be retracted. The second investigator is correct. 
are available. The idea that one should condition on the observed L-A association is an 
example of what is referred to in the statistical literature as the conditionality 
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Technical Point 10.4 


Comparison between adjusted and unadjusted estimators. The adjusted estimator of RDstąa in Technical Point 
10.3 is the maximum likelihood estimator RDmzrEg, which replaces the population proportions in the RDsta by their 
sample proportions. The unadjusted estimator of RDsta is RDyn = Pr(Y = 1|A = 1) — Pr (Y = 1[A = 0). Un- 
conditionally, both RDmLE and RDyn are asymptotically normal and unbiased for RDstąa with asymptotic variances 
aVar (RDuze) and aVar (RDuw). 


In the text we stated that RDun is both unconditionally inefficient and conditionally biased. We now explain that 
both properties are logically equivalent. Robins and Morgenstern (1987) prove that RD mzp has the same asymptotic 








distribution conditional on the approximate ancillary S as it does unconditionally, which implies aVar (RDure) = 


saa ~ aa a eee 2 
aVar (RDuiel5). They also show that aVar (Duze) equals aVar (RDvw) = [aCov (5, RDvuw)| _ Hence 
RDun is unconditionally inefficient if and only if aCou (8, RDun Æ 0, i.e., S and RDyn are correlated uncondition- 
ally. Further, the conditional asymptotic bias a E [RDuw S| — RDstq is shown to equal aCov (3, RDvuw) S. Hence, 


RDun is conditionally biased if and only if it is unconditionally inefficient. 
It can be shown that aC'ov (5, RDun ) = 0 if and only if LILY|A. Therefore, when data on a measured risk 


factor for Y are available, RDmzLe is preferred over RDun. 


principle. In statistics, the observed L-A association is said to be an ancil- 
lary statistic for the causal risk difference. The conditionality principle states 
that inference on a parameter should be performed conditional on ancillary 
statistics (see Technical Points 10.2 and 10.3 for details). The discussion in 
the preceding paragraph then implies that many researchers intuitively follow 
the conditionality principle when they consider an estimator to be biased if it 
cannot center a valid Wald confidence interval conditional on any ancillary sta- 
tistics. For such researchers, our previous definition of bias was not sufficiently 
restrictive. They would say that an estimator is unbiased if and only if it can 
center a valid Wald interval conditional on ancillary statistics. Technical Point 
10.5 argues that most researchers implicitly follow the conditionality principle. 

When confronted with the frequentist argument that “Adjustment for L 
is unnecessary because unconditionally—over all the runs of the experiment— 
the unadjusted estimate is unbiased,” investigators that intuitively apply the 
conditionality principle would aptly respond “Why should the various L-A 
associations in other hypothetical studies affect what I do in my study? In my 
study L acts as a confounder and adjustment is needed to eliminate bias.” This 
is a convincing argument for both randomized experiments and observational 
studies when, as above, the number of measured confounders is not large. 
However, when the number of measured confounders is large, strictly following 
the conditionality principle is no longer a wise strategy. 


10.5 The curse of dimensionality 


The derivations in previous sections above are based on an asymptotic theory 
that assumed the number of strata of L was small compared with the sample 
size. In this section, we study the cases in which the number of strata of a 
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Most researchers intuitively follow the extended conditionality principle. Consider again the randomized trial data 
in Table 10.2. Assuming without loss of generality that the sRD is constant over the strata of a dichotomous L, the 
estimated variance of the MLE of sRD is VoV,/ (% + v1) where V; is the estimated variance of RD). 

Two possible choices for V, are V2" = ms + ti = 1.78 x 1078 = and VP? = ms + Eaki = 1.58 x 107° that 
differ only in that Ver divides by the observed number of individuals in stratum L = 1 with A = 1 and A = 0 (80 and 
40, respectively) while V,"? divides by the expected number of subjects (60) given that ALLL. Mathematically, V obsis 
the variance estimator based on the observed information and ver is the estimator based on the expected information. 

In our experience, nearly all researchers would choose Vos over Ve"? as the appropriate variance estimator. Results 
of Efron and Hinkley (1982) and Robins and Morgenstern (1987) imply that such researchers are implicitly conditioning 
on an approximate ancillary S$ and thus, whether aware of this fact or not, are following the extended conditionality 
principle. Specifically, these authors proved that that the variance of RD), and thus of the MLE, conditioned on an 
approximate ancillary S differs from the unconditional variance by order n~3/?. (As noted in Technical Point 10.4, 
the conditional and unconditional asymptotic variance of an MLE are equal, as equality of asymptotic variances implies 
equality only up to order n~t.) Further, they showed that the variance estimator based on the observed information 
differs from the conditional variance by less than order n~3/2, while an estimator based on the expected information 
differs from the unconditional variance by less than n~?/?. Thus, a preference for Ves over V;* implies a preference 
for conditional over unconditional inference. 


vector L can be very large, even much larger than the sample size. 

Suppose the investigators had measured 100 pre-treatment binary variables 
rather than only one, then the pre-treatment variable L formed by combining 
the 100 variables L = (L4,..., L100) has 2'°° strata. When, as in this case, 
there are many possible combinations of values of the pre-treatment variables, 
we say that the data is of high dimensionality. For simplicity, suppose that 
there is no additive effect modification by L, i.e., the super-population risk 
difference Pr[Y = 1|A = 1, L = l] — Pr[Y = 1|A = 0, L = l] is constant across 
the 2100 strata. In particular, suppose that the constant stratum-specific risk 
difference is 0. 

The investigators debate again whether to retract the article and report 
their estimate of the stratified risk difference. They have by now agreed that 
they should follow the conditionality principle because the unadjusted risk 
difference —0.15 is conditionally biased. However, they notice that, when there 
are 2100 strata, a 95% confidence interval for the risk difference based on the 
adjusted estimator is much wider than that based on the unadjusted estimator. 
This is exactly the opposite of what was found when L had only two strata. 
In fact, the 95% confidence interval based on the adjusted estimator may be 
so wide as to be completely uninformative. 

To see why, note that, because 219° is much larger than the number of 
individuals (240), there will at most be only a few strata of L that will contain 
both a treated and an untreated individual. Suppose only one of 2! strata 
contains a single treated individual and a single untreated individual, and no 
other stratum contains both a treated and untreated individual. Then the 
95% confidence interval for the common risk difference based on the adjusted 
estimator is (—1,1) , and therefore completely uninformative, because in the 
single stratum with both a treated and an untreated individual, the empirical 
risk difference could be —1, 0, or 1 depending on the value of Y for each indi- 
vidual. In contrast, the 95% confidence interval for the common risk difference 
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Can the curse of dimensionality be reversed? In high-dimensional settings with many strata of L, informative 
conditional inference for the common risk difference given the exact ancillary statistic {A;, L;;7 = 1,..n} is not possible 
regardless of the estimator used. This is not true for unconditional inference in marginally randomized experiments. For 
example, the unconditional statistical behavior of the unadjusted estimator RDyy is unaffected by the dimension of 
L. In particular, it remains unbiased with the width of the associated Wald 95% confidence interval proportional to 
1/ni/?. Because RDyw relies on prior information not used by the MLE, it is an unbiased estimator of the common 
risk difference only it is known that ALLL in the super-population. 

However, even unconditionally, the confidence intervals associated with the MLE, i.e., the adjusted estimator, 
remain uninformative. This raises the question of whether data on L can be used to construct an estimator that is also 
unconditionally unbiased but that is more efficient that the unadjusted estimator. In Chapter 18 we show that this is 
indeed possible. 


based on the unadjusted estimator remains (—0.26, —0.04) as above because 
its width is unaffected by the fact that more covariates were measured. These 
results reflect the fact that the adjusted estimator is only guaranteed to be 
more efficient than the unadjusted estimator when the ratio of number of indi- 
viduals to the number of unknown parameters is large (a frequently used rule 
of thumb is a minimum ratio of 10, though the minimum ratio depends on the 
characteristics of the data). 

What should the investigators do? By trying to do the right thing— 
following the conditionality principle—in the simple setting with one dichoto- 
mous variable, they put themselves in a corner for the high-dimensional set- 
ting. This is the curse of dimensionality: conditional on all 100 covariates 
the marginal estimator is still biased, but now the conditional estimator is 

Robins and Ritov (1997) provide a uninformative. This shows that, just because conditionality is compelling in 

technical description of the curse of | simple examples, it should not be raised to a principle since it cannot be car- 

dimensionality. ried through for high-dimensional models. Though we have discussed this issue 
in the context of a randomized experiment, our discussion applies equally to 
observational studies. See Technical Point 10.6 

Finding a solution to the curse of dimensionality is a difficult problem and 
an active area of research. In Chapter 18 we review this research and offer some 
practical guidance. Chapters 11 through 17 provide necessary background 
information on the use of models for causal inference. 
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Part Il 


Causal inference with models 


Chapter 11 
WHY MODEL? 


Do not worry. No more chapter introductions around the effect of your looking up on other people’s looking up. 
We squeezed that example well beyond what seemed possible. In Part II of this book, most examples involve real 
data. The data sets can be downloaded from the book’s web site. 

Part I was mostly conceptual. Calculations were kept to a minimum, and could be carried out by hand. In 
contrast, the material described in Part II requires the use of computers to fit regression models, such as linear 
and logistic models. Because this book cannot provide a detailed introduction to regression techniques, we assume 
that readers have a basic understanding and working knowledge of these commonly used models. Our web site 
provides links to computer code in R, SAS, Stata, and Python to replicate the analyses described in the text. The 
CODE margin notes specify the portion of the code that is relevant to the analysis described in the text. 

This chapter describes the differences between the nonparametric estimators used in Part I and the parametric 
(model-based) estimators used in Part II. It also reviews the concept of smoothing and, briefly, the bias-variance 
trade-off involved in any modeling decision. The chapter motivates the need for models in data analysis, regardless 
of whether the analytic goal is causal inference or, say, prediction. We will take a break from causal considerations 
until the next chapter. Please bear in mind that the statistical literature on modeling is vast; this chapter can 
only highlight some of the key issues. 


11.1 Data cannot speak for themselves 


Consider a study population of 16 individuals infected with the human im- 
munodeficiency virus (HIV). Unlike in Part I of this book, we will not view 
these individuals as representatives of 1 billion individuals identical to them. 
Rather, these are just 16 individuals randomly sampled from a large, possibly 
hypothetical super-population: the target population. 

At the start of the study each individual receives a certain level of a treat- 
ment A (antiretroviral therapy), which is maintained during the study. At the 
end of the study, a continuous outcome Y (CD4 cell count, in cells/mm?) is 
measured in all individuals. We wish to consistently estimate the mean of Y 
among individuals with treatment level A = a in the population from which the 
16 individuals were randomly sampled. That is, the estimand is the unknown 
population parameter E[Y|A = a]. 

As defined in Chapter 10, an estimator E[Y|A = a] of E[Y|A = a] is some 
function of the data that is used to estimate the unknown population parame- 

See Chapter 10 for a rigorous defi- ter. Informally, a consistent estimator E[Y|A = a] meets the requirement that 

nition of a consistent estimator. “the larger the sample size, the closer the estimate to the population value 
E[Y|A = al].” Two examples of possible estimators E[Y|A = al] are (i) the 
sample average of Y among those receiving A = a, and (ii) the value of the 
first observation in the dataset that happens to have the value A = a. The 
sample average of Y among those receiving A = a is a consistent estimator of 
the population mean; the value of the first observation with A = a is not. In 
practice we require all estimators to be consistent, and therefore we use the 
sample average to estimate the population mean. 

Suppose treatment A is a dichotomous variable with two possible values: no 
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treatment (A = 0) and treatment (A = 1). Half of the individuals were treated 
(A = 1). Figure 11.1 is a scatter plot that displays each of the 16 individuals 
as a dot. The height of the dot indicates the value of the individual’s outcome 
Y. The 8 treated individuals are placed along the column A = 1, and the 8 
untreated along the column A = 0. As defined in Chapter 10, an estimate of 
the mean of Y among individuals with level A = a in the population is the 
numerical result of applying the estimator—in our case, the sample average—to 
a particular data set. 

Our estimate of the population mean in the treated is the sample aver- 
age 146.25 for those with A = 1, and our estimate of the population mean 
in the untreated is the sample average 67.50 in those with A = 0. Under ex- 
changeability of the treated and the untreated, the difference 146.25 — 67.50 
would be interpreted as an estimate of the average causal effect of treatment 
A on the outcome Y in the target population. However, this chapter is not 
about making causal inferences. Our current goal is simply to motivate the 
need for models when trying to estimate population quantities like the mean 
E[Y|A = al, irrespective of whether the estimates do or do not have a causal 
interpretation. 

Now suppose treatment A is a polytomous variable that can take 4 possible 
values: no treatment (A = 1), low-dose treatment (A = 2), medium-dose treat- 
ment (A = 3), and high-dose treatment (A = 4). A quarter of the individuals 
received each treatment level. Figure 11.2 displays the outcome value for the 
16 individuals in the study population. To estimate the population means in 
the 4 groups defined by treatment level, we compute the corresponding sample 
averages. The estimates are 70.0, 80.0, 117.5, and 195.0 for A = 1, A = 2, 
A = 3, and A = 4, respectively. 

Figures 11.1 and 11.2 depict examples of discrete (categorical) treatment 
variables with 2 and 4 categories, respectively. Because the number of study 
individuals is fixed at 16, the number of individuals per category decreases as 
the number of categories increases. The sample average in each category is still 
an exactly unbiased estimator of the corresponding population mean, but the 
probability that the sample average is close to the corresponding population 
mean decreases as the number of individuals in each category decreases. The 
length of the 95% confidence intervals (see Chapter 10) for the category-specific 
means will be greater for the data in Figure 11.2 than for the data in Figure 
11.1. 

Finally, suppose that treatment A is a variable representing the dose of 
treatment in mg/day, and that it takes integer values from 0 to 100 mg. Figure 
11.3 displays the outcome value for each of the 16 individuals. Because the 
number of possible values of treatment is much greater than the number of 
individuals in the study, there are many values of A that no individual received. 
For example, there are no individuals with treatment dose A = 90 in the study 
population. 

This creates a problem: how can we estimate the mean of the outcome Y 
among individuals with treatment level A = 90 in the target population? The 
estimator we used for the data in Figures 11.1 and 11.2—the treatment-specific 
sample average—is undefined for treatment levels for which there are zero in- 
dividuals in Figure 11.3. If treatment A were a truly continuous variable, then 
the sample average would be undefined for nearly all treatment levels. (A con- 
tinuous variable A can be viewed as a categorical variable with an uncountably 
infinite number of categories.) 

The above description shows that we cannot always let the data “speak 
for themselves” to obtain a meaningful estimate. Rather, we often need to 
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supplement the data with a model, as we describe in the next section. 


11.2 Parametric estimators of the conditional mean 


More generally, the restriction on 
the shape of the relation is known 
as the functional form and, by 
some authors, as the dose-response 
curve. We do not use the latter 
term because it suggests that the 
dose of treatment causally effects 
the response, which could be false 
in the presence of confounding. 
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Under the assumption that the vari- 
ance of the residuals does not de- 
pend on A (homoscedasticity), the 
Wald 95% confidence intervals are 
(—21.2, 70.3) for ĝo, (1.28, 2.99) 
for 6;, and (172.1,261.6) for 
E[Y|A = 90]. 


We want to estimate the mean of Y among individuals with treatment level 
A = 90, i.e., E[Y|A = 90], from the data in Figure 11.3. Suppose we expect the 
mean of Y among individuals with treatment level A = 90 to lie between the 
mean among individuals with A = 80 and the mean among individuals with 
A = 100. In fact, suppose we knew that the treatment-specific population 
mean of Y is a linear function of the value of treatment A throughout the 
range of A. More precisely, we know that the mean of Y, E[Y |A], increases (or 
decreases) from some value 0) for A = 0 by 0; units per unit of A. Or, more 
compactly, 


E[Y|A] = 0 + 014 


This equation is a restriction on the shape of conditional mean function E[Y |A]. 
This particular restriction is referred to as a linear mean model, and the quan- 
tities 09 and 6, are referred to as the parameters of the model. Models that 
describe the conditional mean function in terms of a finite number of parame- 
ters are referred to as parametric conditional mean models. In our example, 
the parameters 0) and 6; define a straight line that crosses (intercepts) the 
vertical axis at 9) and that has a slope 6,. That is, the model specifies that 
all conditional mean functions are straight lines, though their intercepts and 
slopes may vary. 

We are now ready to combine the data in Figure 11.3 with our parametric 
mean model to estimate E[Y|A = a] for all values a from 0 to 100. The first 
step is to obtain estimates Êo and 6, of the parameters ĝo and 6;. The second 
step is to use these estimates to estimate the mean of Y for any value A = a. 
For example, to estimate the mean of Y among individuals with treatment 
level A = 90, we use the expression E[Y|A = 90] = 0o + 9001. The estimate 
E[Y |A] for each individual is referred to as the predicted value. 

An exactly unbiased estimator of 0) and 0; can be obtained by the method 
of ordinary least squares. A nontechnical motivation of the method follows. 
Consider all possible candidate straight lines for Figure 11.3, each of them 
with a different combination of values of intercept ĝo and slope 1. For each 
candidate line, one can calculate the vertical distance from each dot to the line 
(the residual), square each of those 16 residuals, and then sum the 16 squared 
residuals. The line for which the sum is the smallest is the “least squares” line, 
and the parameter values #9 and 0, of this “least squares” line are the “least 
squares” estimates. The values Êo and 6, can be easily computed using linear 
algebra, as described in any statistics textbook. 

In our example, the parameter estimates are Êo = 24.55 and 6; = 2.14, 
which define the straight line shown in Figure 11.4. The predicted mean of 
Y among individuals with treatment level A = 90 is therefore E[Y|A = 90] = 
24.55 + 90 x 2.14 = 216.9. Because ordinary least squares estimation uses all 
data points to find the best line, the mean of Y in the group A = a, i.e., 
E[Y|A = al, is estimated by borrowing information from individuals who have 
values of treatment A not equal to a. 

So what is a model? A model is defined by an a priori restriction on the 
joint distribution of the data. Our linear conditional mean model says that the 
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conditional mean function E[Y |A] is a straight line, which restricts its shape. 
For example, the model restricts the mean of Y for A = 90 to be between the 
mean of Y for A = 80 and the mean of Y for A = 100. This restriction is 
encoded by the parameters 09 and 6,. A parametric conditional mean model 
is, through its a priori restrictions, adding information to compensate for the 
lack of sufficient information in the data. 

Parametric estimators—those based on a parametric conditional mean model— 
allow us to estimate quantities that cannot be estimated otherwise, e.g., the 
mean of Y among individuals in the target population with treatment level 
A = 90 when no such individuals exist in the study population. But this is not 
a free lunch. When using a parametric model, the inferences are correct only 
if the restrictions encoded in the model are correct, i.e. if the model is cor- 
rectly specified. Thus model-based causal inference—to which a large fraction 
of the remainder of this book is devoted—telies on the condition of (approx- 
imately) no model misspecification. Because parametric models are rarely, if 
ever, perfectly specified, a certain degree of model misspecification is almost al- 
ways expected. This can be at least partially rectified by using nonparametric 
estimators, which we describe in the next section. 


11.3 Nonparametric estimators of the conditional mean 


Let us return to the data in Figure 11.1. Treatment A is dichotomous and we 
want to consistently estimate the mean of Y in the treated E[Y|A = 1] and in 
the untreated E[Y|A = 0]. Suppose we have become so enamored with models 
that we decide to use one to estimate these two quantities. Again we proposed 
a linear model 

E[Y|A] = ðo + 01A 


where E[Y|A = 0] = ĝo + 0 x 0; = 45 and E[Y|A = 1] = o + 1 x 0, = bo + 01. 
We use the least squares method to obtain estimates of the parameters @ and 
CODE: Program 11.2 04. These estimates are 09 = 67.5 and 0; = 78.75. We therefore estimate 
E[Y|A = 0] = 67.5 and E[Y|A = 1] = 146.25. Note that our model-based 
estimates of the mean of Y are identical to the sample averages we calculated 
In this book we define “model” as in Section 11.1. This is not a coincidence but an expected finding. 
an a priori mathematical restric- Let us take a second look at the model E[Y |A] = 6o + 01A with a dichoto- 
tion on the possible states of nature mous treatment A. If we rewrite the model as E[Y|A = 1] = E[Y|A = 0] + 61, 
(Robins, Greenland 1986). Part | we see that the model simply states that the mean in the treated E[Y|A = 1] 
was entitled “Causal inference with- is equal to the mean in the untreated E[Y|.A = 0] plus a quantity 01, where 0, 
out models” because it only de- may be negative, positive or zero. But this statement is of course always true! 
scribed saturated models. The model imposes no restrictions whatsoever on the values of E[Y|A = 1] 
and E[Y|A = 0]. Therefore E[Y|A = a] = 0) + 01A with a dichotomous treat- 
ment A is not a model because it lets the data speak for themselves, just like 
the sample average does. “Models” which do not impose restrictions on the 
distribution of the data are saturated models. Because they formally look like 
models even if they do not fit our definition of model, saturated models are 
ordinarily referred to as models too. 

Generally, the model is saturated whenever the number of parameters in a 
conditional mean model is equal to the number of unknown conditional means 
in the population. For example, the linear model E[Y|A] = ĝo + 0, A has two 
parameters and, when A is dichotomous, there exist two unknown conditional 
means: the means E[Y|A = 1] and E[Y|A = 0]. Since the values of the two 
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Fisher consistency. Our definition of a nonparametric estimator in the main text coincides with what is known in 
statistics as a Fisher consistent estimator (Fisher 1922). That is, an estimator of a population quantity that, when 
calculated using the entire population rather than a sample, yields the true value of the population parameter. By 
definition, a Fisher consistent estimator lacks any model restrictions but, as discussed in the text, a Fisher consistent 
estimate may not exist for many population quantities. Technically, Fisher consistent estimators are the nonparametric 
maximum likelihood estimators of population quantities under a saturated model. 

In the statistical literature, the term nonparametric estimator is sometimes used to refer to estimators that are not 
Fisher consistent but that impose very weak restrictions, such as kernel regression models. See Technical Point 11.1 for 
details. 





A saturated model has the same parameters are not restricted by the model, neither are the values of the means. 
number of unknowns on both sides As a contrast, consider the data in Figure 11.3 where A can take values from 0 
of the equal sign. to 100. The linear model E[Y |A] = ło + 0A has two parameters but estimates 
101 quantities, i.e., E[Y|A = 0], E[Y|A = 1], ..., E[Y|A = 100]. The only hope 
for unbiasedly estimating 101 quantities with these two parameters is to be 
fortunate to have all 101 means E[Y|A = a] lie along a straight line. When a 
model has only a few parameters but it is used to estimate many population 
quantities, we say that the model is parsimonious. 
Here we define nonparametric estimators of the conditional mean function 
as those that produce estimates from the data without any a priori restrictions 
on the conditional mean function (see Fine Point 11.1 for a more rigorous de- 
finition). An example of a nonparametric estimator of the population mean 
E[Y|A = a] for a dichotomous treatment is its empirical version, the sample 
For causal inference, identifiability average or, equivalently, the saturated model described in this section. When 
assumptions are the assumptions A is discrete with 100 levels and no individual in the sample has A = 90, no 
that we would have to make even if nonparametric estimator of E[Y|A = 90] exists. All methods for causal infer- 
we had an infinite amount of data. ence that we described in Part I of this book—standardization, IP weighting, 
Modeling assumptions are the as- stratification, matching—were based on nonparametric estimators of popula- 
sumptions that we have to make tion quantities under a saturated model because they did not impose any a 
precisely because we do not have priori restrictions on the value of the effect estimates. In contrast, most meth- 
an infinite amount of data. ods for causal inference described in Part II of this book rely on estimators 

that are parametric estimators of some part of the distribution of the data. 

Parametric estimation and other approaches to borrow information are our 

only hope when, as is often the case, data are unable to speak for themselves. 


11.4 Smoothing 


Consider again the data in Figure 11.3 and the linear model E[Y|A] = 0o +0, A. 
The parameter 6; is the difference in mean outcome per unit of treatment dose 
A. Because 6; is a single number, the model specifies that the difference in 
mean outcome Y per unit of treatment A must be constant throughout the 
entire range of A, that is, the model requires the conditional mean outcome to 
follow a straight line as a function of treatment dose A. Figure 11.4 shows the 
best-fitting straight line. 

But one can imagine situations in which the difference in mean outcome is 
larger for a one-unit change at low doses of treatment, and smaller for a one- 
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Caution: Often the term “linear” 
is used with two different mean- 
ings. A model is linear when it is 
expressed as a linear combination 
of parameters and functions of the 
variables, even if the latter are non- 
linear functions (e.g., higher powers 
or logarithms) of the covariates. 


CODE: Program 11.3 

Under the homoscedasticity as- 
sumption, the Wald 95% confi- 
dence interval for E[Y|A = 90] is 
(142.8, 251.5). 
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We used a model for continuous 
outcomes as an example. The same 
reasoning applies to models for di- 
chotomous outcomes such as lo- 
gistic models (see Technical Point 
11.1) 


Why model? 


unit change at high doses. This would be the case if, once the treatment dose 
reaches certain level, higher doses have an increasingly small effect. Under this 
scenario, the model E[Y|A] = 09 + 6A is incorrect. However, linear models 
can be made more flexible. 

For example, suppose we fit the model E[Y|A] = 09 + 01A + 024°, where 
A? = Ax A is A-squared, to the data in Figure 11.3. This is still referred 
to as a linear model because the conditional mean is expressed as a linear 
combination, i.e., as the sum of the products of each covariate (A and A?) 
with its associated coefficient (the parameters 6; and 62) plus an intercept 
(00). However, whenever 92 is not zero, the parameters 0o, 01, and 02 now 
define a curve—a parabola—rather than a straight line. We refer to 0; as the 
parameter for the linear term A, and to @2 as the parameter for the quadratic 
term A?. 

The curve under the 3-parameter linear model E[Y|A] = 09 + 014 + 02A? 
can be found via ordinary least squares estimation applied to the data in 
Figure 11.3. The estimated curve is shown in Figure 11.5. The parameter 
estimates are ĝo = = —7.4], 6, = = 4.11, and by = = —0.02. The predicted mean 
of Y among individuals with treatment level A = 90 is obtained from the 
expression E[Y|A = 90] = ĝo +906, + 90 x 906 = 197.1. 

We could keep adding parameters for a cubic term (0343), a quartic term 
(0444)... until we reach a 15th-degree term (6;5A!°). At that point the number 
of parameters in our model equals the number of data points (individuals). The 
shape of the curve would change as the number of parameters increases. In 
general, the more parameters in the model, the more inflection points will 
appear. 

That is, the curve generally becomes more “wiggly,” or less smooth, as the 
number of parameters increase. A linear model with 2 parameters—a straight 
line—is the smoothest model. A linear model with as many parameters as data 
points is the least smooth model because it has as many possible inflection 
points as data points. In fact, such model interpolates the data, i.e., each data 
point in the sample lies on the estimated conditional mean function. 

Often modeling can be viewed as a procedure to transform noisy data into 
more or less smooth curves. This smoothing occurs because the model borrows 
information from many data points to predict the outcome value at a particular 
combination of values of the covariates. The smoothing results from E[Y|A = 
a] being estimated by borrowing information from individuals with A not equal 
to a. All parametric estimators incorporate some degree of smoothing. 

The degree of smoothing depends on how much information is borrowed 
across individuals. The 2-parameter model E[Y|A] = 69 + 01A estimates 
E[Y|A = 90] by borrowing information from all individuals in the study popu- 
lation to find the least squares straight line. A model with as many parameters 
as individuals does not borrow any information to estimate E[Y |A] at the values 
of A that occur in the data, though it borrows information (by interpolation) 
for values of A that do not occur in the data. 

Intermediate degrees of smoothing can be achieved by using an intermediate 
number of parameters or, more generally, by restricting the number of individ- 
uals that contribute to the estimation. For example, to estimate E[Y |A = 90] 
we could decide to fit a 2-parameter model E[Y|A] = ĝo + 01A restricted to 
individuals with treatment doses between 80 and 100. That is, we would only 
borrow information from individuals in a 10-unit window of A = 90. The wider 
the window around A = 90, the more smoothing would be achieved. 

In our simplistic examples above, all models included a single covariate 
(with either a single parameter for A or two parameters for A and A?) so that 


11.5 The bias-variance trade-off 145 





Fine Point 11.2 


Model dimensionality and the relation between frequentist and Bayesian intervals. In frequentist statistical 
inference, probability is defined as frequency. In Bayesian inference, probability is defined as degree-of-belief—a concept 
very different from probability as frequency. Chapter 10 described the confidence intervals used in frequentist statistical 
inference. Bayesian statistical inference uses credible intervals, which have a more natural interpretation: A Bayesian 
95% credible interval means that, given the observed data, “there is a 95% probability that the estimand is in the 
interval”. However, in part because of the requirement to specify the investigators’ degree of belief, Bayesian inference 
is less commonly used than frequentist inference. 

Interestingly, in simple, low-dimensional parametric models with large sample sizes, 95% Bayesian credible intervals 
are also 95% frequentist confidence intervals, whereas in high-dimensional or nonparametric models, a Bayesian 95% 
credible interval may not be a 95% confidence interval as it may trap the estimand much less than 95% of the time. 
The underlying reason for these results is that Bayesian inference requires the specification of a prior distribution for 
all unknown parameters. In low-dimensional parametric models the information in the data swamps that contained in 
reasonable priors. As a result, inference is relatively insensitive to the particular prior distribution selected. However, 
this is no longer the case in high-dimensional models. Therefore if the true parameter values that generated the data 
are unlikely under the chosen prior distribution, the center of Bayes credible interval will be pulled away from the true 
parameters and towards the parameter values given the greatest probability under the prior. 


the curves can be represented on a two-dimensional book page. In realistic 
applications, models often include many different covariates so that the curves 
are really hyperdimensional surfaces. Regardless of the dimensionality of the 
problem, the concept of smoothing remains invariant: the fewer parameters in 
the model, the smoother the prediction (response) surface will be. 


11.5 The bias-variance trade-off 


In previous sections we have used the 16 individuals in Figure 11.3 to estimate 
the mean outcome Y among people receiving a treatment dose of A = 90 in 
the target population, E[Y|A = 90]. Since nobody in the study population 
received A = 90, we could not let the data speak for themselves. So we 
combined the data with a linear model. The estimate E[Y|A = 90] varied with 
the model. Under the 2-parameter model E[Y |A] = 0o + 014A, the estimate 
was 216.9 (95% confidence interval: 172.1, 261.6). Under the 3-parameter 
model E[Y|A] = 6) + 01A + 602A”, the estimate was 197.1 (95% confidence 
interval: 142.8, 251.5). We used two different parametric models that yielded 
two different estimates. Which one is better? Is 216.9 or 197.1 closer to the 
mean in the target population? 

If the relation is truly curvilinear, then the estimate from the 2-parameter 
model will be biased because this model assumes a straight line. On the other 
hand, if the relation is truly a straight line, then the estimates from both models 
will be valid. This is so because the 3-parameter model E[Y|A] = 09 + 601A + 
62 A? is correctly specified whether the relation follows a straight line (in which 
case 02 = 0) or a parabolic curve (in which case 02 # 0). One safe strategy 
would be to use the 3-parameter model E[Y|A] = 69 + 01 A + 62A? rather than 
the 2-parameter model E[Y |A] = 6) + 6, A. Because the 3-parameter model is 
correctly specified under both a straight line and a parabolic curve, it is less 
likely to be biased. In general, the larger the number of parameters in the 
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Fine Point 11.2 discusses the impli- 
cations of model dimensionality for 
frequentist and Bayesian intervals. 


Why model? 


model, the fewer restrictions the model imposes; the less smooth the model, 
the more protection afforded against bias from model misspecification. 

Although less smooth models may yield a less biased estimate, they also 
result in a larger variance, i.e., wider 95% confidence intervals around the 
estimate. For example, the estimated 95% confidence interval around E[Y|A = 
90] was much wider when we used the 3-parameter model than when we used 
the 2-parameter model. However, when the estimate E[Y|A = 90] based on the 
2-parameter model is biased, the standard (nominal) 95% confidence interval 
is not calibrated, that is, it does not cover the true parameter E[Y|A = 90] 
95% of the time. 

This bias-variance trade-off is at the heart of many data analyses. Investi- 
gators using models need to decide whether some protection against bias—by, 
say, adding more parameters to the model—is worth the cost in terms of vari- 
ance. Though some formal procedures exist to aid these decisions, in practice 
many investigators decide which model to use based on criteria like tradition, 
interpretability of the parameters, and software availability. In this book we 
will usually assume that our parametric models are correctly specified. This 
is an unrealistic assumption, but it allows us to focus on the problems that 
are specific to causal analyses. Model misspecification is, after all, a problem 
that can arise in any sort of data analysis, regardless of whether the estimates 
are endowed with a causal interpretation. In practice, careful investigators will 
always question the validity of their models, and will conduct an analysis to 
assess the sensitivity of their estimates to model specification. 

We are now ready to describe the use of models for causal inference. 
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Technical Point 11.1 


A taxonomy of commonly used models. The main text describes linear conditional mean models of the form 


E[Y|X] = 0X = > 6;X;, where X is a vector of covariates Xo, X1,...Xp with Xo = 1 for all n individuals. These 
models are a siibecbok larger class of conditional mean models which have two components: a linear functional form or 
predictor D 6;X; and a link function g {-} such that g {E[Y|X]} = > 0; Xj. 

The inet conditional mean models described in the main text S the identity link function. Conditional mean 
models for outcomes with strictly positive values (e.g., counts, the numerator of incidence rates) often use the log 


p 
link function to ensure that all predicted values will be greater than zero, i.e., log {E[Y|X]} = > 6;X; so E[Y|X] = 
i=0 


p 
exp (= aX). Conditional mean models for dichotomous outcomes (i.e., those that only take values 0 and 1) often 
i=0 


use a logit link i.e., log { eke } = 2 0i Xi, so that E[Y |X] = expit (Zax). This link ensures that all predicted 
values will be greater than 0 and less than 1. Conditional mean models that use the logit function are referred to as 
logistic regression models, and they are widely used in this book. For these links (referred to as canonical links) we can 
estimate 0 by maximum likelihood under a normal model for the identity link, a Poisson model for the log link, and a 
logistic regression model for the logit link. These estimates are consistent for 0 as long as the conditional mean model 
for E[Y|X] is correct. Generalized estimating equation (GEE) models, often used to deal with repeated measures, are 
a further example of a conditional mean model (Liang and Zeger, 1986). 

Conditional mean models only specify a parametric form for E[Y |X] but do not otherwise restrict the distribution of 
Y|X or the marginal distribution of X. Therefore, when X or Y are continuous, a parametric conditional mean model 
is a semiparametric model for the joint distribution of the data (X,Y) because parts of the distribution are modeled 
parametrically whereas others are left unrestricted. The model is semiparametric because the set of all unrestricted 
components of the joint distribution cannot be represented by a finite number of parameters. 

Conditional mean models themselves can be generalized by relaxing the assumption that E[Y |X] takes a parametric 
form. For example, a kernel regression model does not impose a specific functional form on E[Y |X] but rather estimates 

n n 
E[Y|X = a] for any x by X` wp (a — Xi) ¥i/ X wn (x — Xi) where wp (z) is a positive function, known as a kernel 
i=1 i=1 
function, that attains its maximum value at z = 0 and decreases to 0 as |z| gets large at a rate that depends on the 
parameter h subscripting w. As another example, generalized additive models (GAMs) replace the linear combination 


S 0;X; of a conditional mean model by a sum of smooth functions 3 fi(Xi). The model can be estimated using a 
i=0 i=0 
backfitting algorithm with f;(-) estimated at iteration k by, for example, kernel regression (Hastie and Tibshirani 1990). 
In the text we discuss smoothing with parametric models which specify an a priori functional form for E[Y|X = a], 
such as a parabola. In estimating E[Y|X = a], the model may borrow information from values of X that are far from 
x. In contrast, kernel regression models do not specify an a priori functional form and borrow information only from 
values of X near to x when estimating E [Y |X = z]. A kernel regression model is an example of a “non-parametric” 
regression model. This use of the term “nonparametric” differs from our previous usage. Our nonparametric estimators 
of E[Y|X = z] only used those individuals for whom X equalled x exactly; no information was borrowed even from 
close neighbors. Here “nonparametric” estimators of E [Y |X = 2] use individuals with values of X near to x. How near 
is controlled by a smoothing parameter referred to as the bandwidth h. 
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Chapter 12 
IP WEIGHTING AND MARGINAL STRUCTURAL MODELS 


Part II is organized around the causal question “what is the average causal effect of smoking cessation on body 
weight gain?” In this chapter we describe how to use IP weighting to estimate this effect from observational data. 
Though IP weighting was introduced in Chapter 2, we only described it as a nonparametric method. We now 
describe the use of models together with IP weighting which, under additional assumptions, will allow us to tackle 
high-dimensional problems with many covariates and nondichotomous treatments. 

To estimate the effect of smoking cessation on weight gain we will use real data from the NHEFS, an acronym 
that stands for (ready for a long name?) National Health and Nutrition Examination Survey Data I Epidemi- 
ologic Follow-up Study. The NHEFS was jointly initiated by the National Center for Health Statistics and the 
National Institute on Aging in collaboration with other agencies of the United States Public Health Service. A 
detailed description of the NHEFS, together with publicly available data sets and documentation, can be found at 
wwwn.cdc.gov/nchs/nhanes/nhefs/. For this and future chapters, we will use a subset of the NHEFS data that 
is available from this book’s web site. We encourage readers to improve upon and refine our analyses. 


12.1 The causal question 


Our goal is to estimate the average causal effect of smoking cessation (the 

treatment) A on weight gain (the outcome) Y. To do so, we will use data 
We restricted the analysis to indi- from 1566 cigarette smokers aged 25-74 years who, as part of the NHEFS, had 
viduals with known sex, age, race, a baseline visit and a follow-up visit about 10 years later. Individuals were 
weight, height, education, alcohol classified as treated A = 1 if they reported having quit smoking before the 
use and intensity of smoking at follow-up visit, and as untreated A = 0 otherwise. Each individual’s weight 
the baseline (1971-75) and follow- gain Y was measured (in kg) as the body weight at the follow-up visit minus 
up (1982) visits, and who answered the body weight at the baseline visit. Most people gained weight, but quitters 
the medical history questionnaire at gained more weight on average. The average weight gain was E[Y|A = 1] = 4.5 





baseline. See Fine Point 12.1. kg in the quitters, and E[Y|A = 0] = 2.0 kg in the non-quitters. The difference 
E[Y|A = 1] — E[Y|A = 0] was therefore estimated to be 2.5, with a 95% 
Table 12.1 confidence interval from 1.7 to 3.4. A conventional statistical test of the null 
Mean baseline A hypothesis that this difference was equal to zero yielded a P-value< 0.001. 
characteristics 1 0 We define E[Y°~"] as the mean weight gain that would have been observed 
Age, years 46.2 42.8 if all individuals in the population had quit smoking before the follow-up visit, 
Men, % 54.6 46.6 and E[Y°=°] as the mean weight gain that would have been observed if all 
White, % 91.1 85.4 individuals in the population had not quit smoking. We define the average 
University, % 15.4 9.9 causal effect on the additive scale as E[Y°='] — E[Y°=?], that is, the difference 
Weight, kg 72.4 70.3 in mean weight that would have been observed if everybody had been treated 
Cigarettes/day 18.6 21.2 compared with untreated. This is the causal effect that we will be primarily 
Years smoking 26.0 24.1 concerned with in this and the next chapters. 
Little exercise, % 40.7 37.9 The associational difference E[Y|A = 1] — E[Y|A = 0], which we estimated 
Inactive life, % 11.2 8.9 in the first paragraph of this section, is generally different from the causal 





difference E[Y¢=!] — E[Y°=°]. The former will not generally have a causal 
interpretation if quitters and non-quitters differ with respect to characteristics 
that affect weight gain. For example, quitters were on average 4 years older 
than non-quitters (quitters were 44% more likely to be above age 50 than non 
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Fine Point 12.1 


Setting a bad example. Our smoking cessation example is convenient: it does not require deep subject-matter 
knowledge and the data are publicly available. One price we have to pay for this convenience is potential selection bias. 

We classified individuals as treated A = 1 if they reported (i) being smokers at baseline in 1971-75, and (ii) having 
quit smoking in the 1982 survey. Condition (ii) implies that the individuals included in our study did not die and were 
not otherwise lost to follow-up between baseline and 1982 (otherwise they would not have been able to respond to the 
survey). That is, we selected individuals into our study conditional on an event—responding the 1982 survey—that 
occurred after the start of the treatment—smoking cessation. If treatment affects the probability of selection into the 
study, we might have selection bias as described in Chapter 8. (Because different individuals quit smoking at different 
times, A is actually a time-varying treatment, which we will ignore throughout Part Il. Time-varying treatments are 
discussed in Part III.) 

A randomized experiment of smoking cessation would not have this problem. Each individual would be assigned to 
either smoking cessation or no smoking cessation at baseline, so that their treatment group would be known even if the 
individual did not make it to the 1982 visit. In Section 12.6 we describe how to deal with potential selection bias due 
to censoring or missing data for the outcome—something that may occur in both observational studies and randomized 
experiments—but the situation described in this Fine Point is different: the missing data concerns the treatment itself. 
This selection bias can be handled through sensitivity analysis, as was done by Hernán et al. (2008, Appendix 3). 

The choice of this example allows us to describe, in our own analysis, a ubiquitous problem in published analyses 
of observational data: a misalignment of treatment assignment and eligibility at the start of follow-up (Hernán et al. 
2016). Though we decided to ignore this issue in order to keep our analysis simple, didactic convenience would not be 
a good excuse to avoid dealing with this bias in real life. 





quitters), and older people gained less weight than younger people, regardless 
Fine Point 7.3 defined surrogate of whether they did or did not quit smoking. We say that age is a (surrogate) 
confounders. confounder of the effect of A on Y and our analysis needs to adjust for age. The 
unadjusted estimate 2.5 might underestimate the true causal effect E[Y°="] — 
E[y*—]. 
As shown in Table 12.1, quitters and non-quitters also differed in their 
CODE: Program 12.1 computes the distribution of other variables such as sex, race, education, baseline weight, 
descriptive statistics shown in this and intensity of smoking. If these variables are confounders, then they also 
section need to be adjusted for in the analysis. In Chapter 18 we discuss strategies 
for confounder selection. Here we assume that the following 9 variables, all 
measured at baseline, are sufficient to adjust for confounding: sex (0: male, 
1: female), age (in years), race (0: white, 1: other), education (5 categories), 
intensity and duration of smoking (number of cigarettes per day and years 
of smoking), physical activity in daily life (3 categories), recreational exercise 
(3 categories), and weight (in kg). That is, L represents a vector of 9 mea- 
sured covariates. In the next section we use IP weighting to adjust for these 
covariates. 


12.2 Estimating IP weights via modeling 


IP weighting creates a pseudo-population in which the arrow from the covari- 
ates L to the treatment A is removed. More precisely, the pseudo-population 
has the following two properties: A and L are statistically independent and 
the mean Eps[Y |A = a] in the pseudo-population equals the standardized mean 
X E[Y|A = a, L = l] Pr[L = 1] in the actual population. These properties are 
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The conditional probability of treat- 
ment Pr[A=1|L] is known as 
the propensity score. More about 
propensity scores in Chapter 15. 


The curse of dimensionality was in- 
troduced in Chapter 10. 


CODE: Program 12.2 

The estimated IP weights W4 
ranged from 1.05 to 16.7, and their 
mean was 2.00. 


E[Y|A] = 6) + 01A is a saturated 
model because it has 2 parameters, 
Oo and @,, to estimate two quanti- 
ties, E[Y|A = 1] and E[Y|A = 0]. 
In this model, 0; = E[Y|A = 1] — 
E[Y|A = 0]. 


true even if conditional exchangeability Y*1LA|L does not hold in the ac- 
tual population (see Technical Point 2.3). Now, if conditional exchangeability 
Y“1LA|L holds in the actual population, then these properties imply that (i) 
the mean of Y° is the same in both populations, (ii) unconditional exchange- 
ability (i.e., no confounding) holds in the pseudo-population, (iii) the counter- 
factual mean E[Y“] in the actual population is equal to E,,[Y|A = a] in the 
pseudo-population, and (iv) association is causation in the pseudo-population. 
Please reread Chapter 2 if you need a refresher on IP weighting. 

Informally, the pseudo-population is created by weighting each individual 
by the inverse (reciprocal) of the conditional probability of receiving the treat- 
ment level that she indeed received. The individual-specific IP weights for 
treatment A are defined as W4 = 1/f(A|L). For our dichotomous treat- 
ment A, the denominator f (A|L) of the IP weight is the probability of quit- 
ting conditional on the measured confounders, Pr [A = 1|L£], for the quitters, 
and the probability of not quitting conditional on the measured confounders, 
Pr [A = O|L], for the non-quitters. We only need to estimate Pr [A = 1|L] be- 
cause Pr [A = 0|L] = 1 — Pr [A = 1|Z]. 

In Section 2.4 we estimated the quantity Pr [A = 1|L] nonparametrically: 
we simply counted how many people were treated (A = 1) in each stratum of 
L, and then divided this count by the number of individuals in the stratum. 
All the information required for this calculation was taken from a causally 
interpreted structured tree with 4 branches (2 for L times 2 for A). But non- 
parametric estimation of Pr [A = 1|L] is out of the question when, as in our 
example, we have high-dimensional data with many confounders, some of them 
with many levels. Even if we were willing to recode all 9 confounders except 
age to a maximum of 6 categories each, our tree would still have over 2 mil- 
lion branches. And many more millions if we use the actual range of values 
of duration and intensity of smoking, and weight. We cannot obtain meaning- 
ful nonparametric stratum-specific estimates when there are 1566 individuals 
distributed across millions of strata. We need to resort to modeling. 

To obtain parametric estimates of Pr [A = 1|L] in each of the millions of 
strata defined by L, we fit a logistic regression model for the probability of 
quitting smoking with all 9 confounders included as covariates. We used linear 
and quadratic terms for the (quasi-)continuous covariates age, weight, inten- 
sity and duration of smoking, and we included no product terms between the 
covariates. That is, our model restricts the possible values of Pr [A = 1| L] such 
that, on the logit scale, the conditional relation between the continuous covari- 
ates and the risk of quitting can be represented by a parabolic curve, and each 
covariate’s contribution to the (logit of the) risk is independent of that of the 
other covariates. Under these parametric restrictions, we were able to obtain 
an estimate Pr [A = 1|L] for each combination of L values, and therefore for 
each of the 1566 individuals in the study population. p 

The next step is computing the difference E,,[Y|A = 1] — Eps[Y|A = 0] 
in the pseudo-population created by the estimated IP weights. If there is 
no confounding for the effect of A in the pseudo-population and the model 
for Pr [A = 1|L] is correct, association is causation and an unbiased estimator 
of the associational difference E,,[Y|A = 1] — E,,[Y|A = 0] in the pseudo- 
population is also an unbiased estimator of the causal difference E[Y°=!] — 
E[Y?=°] in the actual population. 

Our approach to estimate E,,[Y|A = 1] — E,s[Y|A = 0] in the pseudo- 
population was to fit the (saturated) linear mean model E[Y|A] = 09 + 01A 
by weighted least squares, with individuals weighted by their estimated IP 


weights W: 1/Pr [A = 1|L] for the quitters, and 1/ (1 -Pr [A= 112) for the 
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Technical Point 12.1 


Horvitz-Thompson estimators. In Technical Point 3.1, we defined the “apparent” IP weighted mean for treatment 


I(A=a)Y 
level a, E a , which is equal to the counterfactual mean E[Y®] under positivity and exchangeability. This IP 
P . f ‘ Rest . ; aJI(A=a)Y . 
weighted mean is consistently estimated by the original Horvitz- Thompson (1952) estimator E TAN . In this 
chapter, however, we estimated E/Y“] via the IP weighted least squares estimate ĝo +ĝa, a modified Horvitz- Thompson 
a~[I(A=a)Y 
"Fa 
estimator often referred to as Hajek estimator 1a (Hajek 1971, Robins 1998). 
Fy | ee 
[Fam] 
I(A=a)Y 
: f ; ; ‘ f (A|L) ; eect I(A=a)Y 
The Hajek estimator is an unbiased estimator of —=—~——~—=. which, under positivity, is equal to E | ——————— 
E e] F (AJD) 
f (AIL) 
I(A=a) ; s . . ; . 
because E FAD = 1. In practice, the Hajek estimator is preferred because, unlike the Horvitz- Thompson 


estimator, it is guaranteed to lie between 0 and 1 for dichotomous Y. 





On the other hand, if positivity does not hold, then the ratio equals 


X E[Y|A =a, L=1,L € Q(a)| Pr[L = 1l|L € Q(a)] and, if exchangeability holds, it equals E [Y°Ħ°|L € Q(a)] ,where 
l 


Q(a) = {l;Pr (A = a|L = l) > 0} is the set of values 7 for which A = a may be observed with positive probability. 
Therefore, as discussed in Technical Point 3.1, the difference between Hajek estimators with a = 1 versus a = 0 does 


not have a causal interpretation in the absence of positivity. 





The weighted least squares esti- 
mates Êo and 6, with weight W 
of A and 6, are the minimizers 
of X; W; [Yi — (00 +014). If 
W, = 1 for all individuals i, we ob- 
tain the ordinary least squares es- 
timates described in the previous 
chapter. p 

The estimate E[Y|A = a] = 6) + 


hc YW; 
0a is equal to Diet Wi where 

y i=1 cs i 3 
the sum is over all subjects with 
A=a. 


non-quitters. The parameter estimate ĝi was 3.4. That is, we estimated that 
quitting smoking increases weight by 6, = 3.4 kg on average. See Technical 
Point 12.1 for a formal definition of the estimator. 

To obtain a 95% confidence interval around the point estimate 6, = 3.4 
we need a method that takes the IP weighting into account. One possibil- 
ity is to use statistical theory to derive the corresponding variance estimator. 
This approach requires that the data analyst programs the estimator, which 
is not generally available in standard statistical software. A second possibility 
is to approximate the variance by nonparametric bootstrapping (see Techni- 
cal Point 13.1). This approach requires appropriate computing resources, or 
lots of patience, for large databases. A third possibility is to use the robust 
variance estimator (e.g., as used for GEE models with an independent working 
correlation) that is a standard option in most statistical software packages. 
The 95% confidence intervals based on the robust variance estimator are valid 
but, unlike the above analytic and bootstrap estimators, conservative—they 
cover the super-population parameter more than 95% of the time. The con- 
servative 95% confidence interval around 6; was (2.4,4.5). In this chapter, all 
confidence intervals for IP weighted estimates are conservative. If the model 
for Pr [A = 1|L] is misspecified, the estimates of 09 and 0; will be biased and, 
like we discussed in the previous chapter, the confidence intervals may cover 
the true values less than 95% of the time. 


12.3 Stabilized IP weights 


12.3 Stabilized IP weights 


The average causal effect in the 
treated subpopulation can be esti- 
mated by using IP weights in which 
the numerator is Pr[A = 1|L]. See 
Technical Point 4.1. 
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The goal of IP weighting is to create a pseudo-population in which there is 
no association between the covariates L and treatment A. The IP weights 
W4 = 1/f (A|L) simulate a pseudo-population in which all members of the 
study population are replaced by two copies of themselves. One copy receives 
treatment value A = 1 and the other copy receives treatment value A = 0. 
In Chapter 2 we showed how the original study population in Figure 2.1 was 
transformed into the pseudo-population in Figure 2.3. Note that the expected 
mean of the weights W^ is 2 because, heuristically, in the pseudo-population 
all individuals are included both under treatment and under no treatment. 


However, there are other ways to create a pseudo-population in which A and 
L are independent. For example, a pseudo-population in which all individuals 
have a probability of receiving A = 1 equal to 0.5 and a probability of receiving 
A = 0 also equal to 0.5, regardless of their values of L. Such pseudo-population 
is constructed by using IP weights 0.5/f (A|Z). This pseudo-population would 
be of the same size as the study population and it would be algebraically equal 
to the pseudo-population of the previous paragraph if all weights are divided 
by 2. Hence, the expected mean of the weights 0.5/f (A|L) is 1 and the effect 
estimate obtained in the pseudo-population created by weights 0.5/f (AJL) is 
equal to that obtained in the pseudo-population created by weights 1/f (A|L). 
(You can check this empirically by using the data in Figure 2.1, or see the proof 
in Technical Point 12.2.) The same goes for any other IP weights p/f (A|L) 
with 0 < p < 1. The weights W4 = 1/f (A|L) are just one particular example 
of IP weights with p = 1. 


Let us take our reasoning a step further. The key requirement for confound- 
ing adjustment is that, in the pseudo-population, the probability of treatment 
A does not depend on the confounders L. We can achieve this requirement 
by assigning treatment with the same probability p to everyone in the pseudo- 
population. But we can also achieve it by creating a pseudo-population in 
which different people have different probabilities of treatment, as long as the 
probability of treatment does not depend on the value of L. For example, a 
common choice is to assign to the treated the probability of receiving treatment 
Pr [A = 1] in the original population, and to the untreated the probability of 
not receiving treatment Pr[A = 0] in the original population. Thus the IP 
weights are Pr [A = 1] /f (A|L) for the treated and Pr [A = 0] /f (A|Z) for the 
untreated or, more compactly, f (A) /f (A|Z). 


Figure 12.1 shows the pseudo-population that is created by the IP weights 
f (A) /f (A|Z) when applied to the data in Figure 2.1, where Pr[A = 1] = 
13/20 = 0.65 and Pr [|A = 0] = 7/20 = 0.35. Under the identifiability condi- 
tions of Chapter 3, the pseudo-population resembles a hypothetical randomized 
experiment in which 65% of the individuals in the study population have been 
randomly assigned to A = 1, and 35% to A = 0. Note that, to preserve 
the 65/35 ratio, the number of individuals in each branch cannot be integers. 
Fortunately, non-whole people are no big deal in mathematics. 


In our smoking cessation example, the IP weights f (A) /f (A|L) range from 
0.33 to 4.30, whereas the IP weights 1/f (A|Z) range from 1.05 to 16.7. The 
stabilizing factor f (A) in the numerator is responsible for the narrower range 
of the f (A) /f (A|L) weights. The IP weights W4 = 1/f (A|L) are referred to 
as nonstabilized weights, and the IP weights SW4 = f (A) / f (A|L) are referred 
to as stabilized weights. The mean of the stabilized weights is expected to be 1 
because the size of the pseudo-population equals that of the study population. 
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Figure 12.1 


In data analyses one should always 
check that the estimated weights 
SWA have mean 1 (Hernan and 
Robins 2006a). Deviations from 1 
indicate model misspecification or 
possible violations, or near viola- 
tions, of positivity. See Fine Point 
12.2 for more on checking positiv- 
ity. 
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The estimated IP weights SW4 
ranged from 0.33 to 4.30, and their 
mean was 1.00. 
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Let us now re-estimate the effect of quitting smoking on body weight by 
using the stabilized IP weights SW“. First, we need an estimate of the con- 
ditional probability Pr [A = 1|L] to construct the denominator of the weights. 
We use the same logistic model we used in Section 12.2 to obtain a parametric 
estimate Pr [A = 1|L] for each of the 1566 individuals in the study population. 
Second, we need to estimate Pr [A = 1] for the numerator of the weights. We 
can obtain a nonparametric estimate by the ratio 403/1566 or, equivalently, 
by fitting a saturated logistic model for Pr [A = 1] with an intercept and no 
covariates. Finally, we estimate the causal difference E[Y°='] — E[Y°=°] by 
fitting the mean model E[Y |A] = 09 + 01A with individuals weighted by their 
estimated stabilized IP weights: Pr [A= 1] /Pr [A = 1|L] for the quitters, and 


(1 -Pr [A= 1)) / (1 -Pr [A= 112) for the non-quitters. Under our assump- 


tions, we estimated that quitting smoking increases weight by 6, = 3.4 kg (95% 
confidence interval: 2.4, 4.5) on average. This is the same estimate we obtained 
earlier using the nonstabilized IP weights W4 rather than the stabilized IP 
weights SW4, 


If nonstabilized and stabilized IP weights result in the same estimate, why 
use stabilized IP weights then? Because stabilized weights typically result in 
narrower 95% confidence intervals than nonstabilized weights. However, the 
statistical superiority of the stabilized weights can only occur when the (IP 
weighted) model is not saturated. In our above example, the two-parameter 
model E[Y |A] = fo + 01A was saturated because treatment A could only take 
2 possible values. In many settings (e.g., time-varying or continuous treat- 
ments), the weighted model cannot possibly be saturated and therefore stabi- 
lized weights are used. The next section describes the use of stabilized weights 
for a continuous treatment. 
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Fine Point 12.2 


Checking positivity. In our study, there are 4 white women aged 66 years and none of them quit smoking. That is, the 
probability of A = 1 conditional on (a subset of) L is 0. Positivity, a condition for IP weighting, is empirically violated. 
There are two possible ways in which positivity can be violated: 


e Structural violations: The type of violations described in Chapter 3. Individuals with certain values of L cannot 
possibly be treated (or untreated). An example: when estimating the effect of exposure to certain chemicals on 
mortality, being off work is an important confounder because people off work are more likely to be sick and to die, 
and a determinant of chemical exposure—people can only be exposed to the chemical while at work. That is, the 
structure of the problem guarantees that the probability of treatment conditional on being off work is exactly 0 
(a structural zero). We'll always find zero cells when conditioning on that confounder. 


e Random violations: The type of violations described in the first paragraph of this Fine Point. Our sample is finite 
so, if we stratify on several confounders, we will start finding zero cells at some places even if the probability 
of treatment is not really zero in the target population. This is a random, not structural, violation of positivity 
because the zeroes appear randomly at different places in different samples of the target population. An example: 
our study happened to include 0 treated individuals in the strata “white women age 66” and “white women age 
67”, but it included a positive number of treated individuals in the strata “white women age 65” and “white 
women age 69.” 


Each type of positivity violation has different consequences. In the presence of structural violations, causal inferences 
cannot be made about the entire population using IP weighting or standardization. The inference needs to be restricted 
to strata in which structural positivity holds. See Technical Point 12.1 for details. In the presence of random violations, 
we used our parametric model to estimate the probability of treatment in the strata with random zeroes using data 
from individuals in the other strata. In other words, we use parametric models to smooth over the zeroes. For example, 
the logistic model used in Section 12.2 estimated the probability of quitting in white women aged 66 by interpolating 
from all other individuals in the study. Every time we use parametric estimation of IP weights in the presence of zero 
cells—like we did in estimating ĝi = 3.4—, we are effectively assuming random nonpositivity. 


12.4 Marginal structural models 


Consider the following linear model for the mean outcome under treatment 
This is a (saturated) marginal level a 
structural mean model for a di- E[Y“] = bo + Pia 


chotomous treatment A. k 3 ? i 
This model is different from all models we have considered so far: the out- 


come variable of this model is counterfactual—and hence generally unobserved. 
Therefore the model cannot be fit to the data of any real-world study. Models 
for the marginal mean of a counterfactual outcome are referred to as marginal 
structural mean models. 

The parameters for treatment in structural mean models correspond to 
average causal effects. In the above model, the parameter ĝı is equal to 
E[Y°='] — E[Y*=°] because E[Y°] = 80 under a = 0 and E[Y?] = bo + i 
under a = 1. In previous sections, we have estimated the average causal effect 
of smoking cessation A on weight change Y defined as E[Y°='] — E[Y*=°]. 
In other words, we have estimated the parameter 3, of a marginal structural 
model. 

Specifically, we used IP weighting to construct a pseudo-population, and 
then fit the model E[Y |A] = fo + 01A to the pseudo-population data by using 
IP weighted least squares. Under our assumptions, association is causation 
in the pseudo-population. That is, the parameter 6, from the IP weighted 
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A desirable property of marginal 
structural models is null preserva- 
tion (see Chapter 9): when the null 
hypothesis of no average causal ef- 
fect is true, a marginal structural 
model is never misspecified. For 
example, under marginal structural 
model E[Y°] = bo + Bia + 620°, 
a Wald test on two degrees of free- 
dom of the joint hypothesis 6, = 
B2 = 0 is a valid test of the null 
hypothesis. 


A (nonsaturated) marginal struc- 
tural mean model for a continuous 
treatment A. 


CODE: Program 12.4 

The estimated SW ranged from 
0.19 to 5.10 with mean 1.00. We 
assumed constant variance (ho- 
moscedasticity), which seemed rea- 
sonable after inspecting a residuals 
plot. Other choices of distribution 
(e.g., truncated normal with het- 
eroscedasticity) resulted in similar 
estimates. 


IP weighting and marginal structural models 


associational model E[Y |A] = 69 + 01A can be endowed with the same causal 
interpretation as the parameter 6, from the structural model E[Y*] = 6o + 
B,a. It follows that a consistent estimate 6, of the associational parameter 
in the pseudo-population is also a consistent estimator of the causal effect 
By = E[Y°="] — E[Y °°] in the population. 

The marginal structural model E[Y*] = o + fia is saturated because 
smoking cessation A is a dichotomous treatment. That is, the model has 2 
unknowns on both sides of the equation: E[Y“=1] and E[Y 7°] on the left-hand 
side, and o and (, on the right-hand side. Thus sample averages computed 
in the pseudo-population were enough to estimate the causal effect of interest. 

But treatments are often polytomous or continuous. For example, consider 
the new treatment A “change in smoking intensity” defined as number of ciga- 
rettes smoked per day in 1982 minus number of cigarettes smoked per day at 
baseline. Treatment A can now take many values, e.g., —25 if an individual 
decreased his number of daily cigarettes by 25, 40 if an individual increased 
his number of daily cigarettes by 40. Let us say that we are interested in 
estimating the difference in average weight change under different changes in 
treatment intensity in the 1162 individuals who smoked 25 or fewer cigarettes 
per day at baseline. That is, we want to estimate E[Y °] — E[Y“] for any values 
a anda’. 

Because treatment A can take dozens of values, a saturated model with 
as many parameters becomes impractical. We will have to consider a non- 
saturated structural model to specify the dose-response curve for the effect of 
treatment A on the mean outcome Y. If we believe that a parabola appropri- 
ately describes the dose-response curve, then we would propose the marginal 
structural model 


E[Y“] = Bo + Bia + b20” 


where a? = a x a is a-squared and E[Y*°] = ĝo is the average weight gain 


under a = 0, i.e., under no change in smoking intensity between baseline and 
1982. 

Suppose we want to estimate the average causal effect of increasing smoking 
intensity by 20 cigarettes per day compared with no change, i.e., E[Y¢=?°] — 
E[Y*=°]. According to our structural model, E[Y*=?°] = 6o + 208, + 40062, 
and thus E[Y°=?°] — E[Y°=°] = 2061 + 40082. Now we need to estimate the 
parameters 3, and 32. To do so, we need to estimate IP weights SW to 
create a pseudo-population in which there is no confounding by L, and then 
fit the associational model E[Y |A] = ĝo +01 A +62A? to the pseudo-population 
data. 

To estimate the stabilized weights SW4 = f (A) /f (A|L) we need to es- 
timate f (A|L). For a dichotomous treatment A, f (A|Z) is a probability so 
we used a logistic model to estimate Pr[A = 1|L]. For a continuous treat- 
ment A, f (A|L) is a probability density function (PDF). Unfortunately, PDFs 
are generally hard to estimate correctly, which is why using IP weighting for 
continuous treatments will often be dangerous. In our example, we assumed 
that the density f (A|Z) was normal (Gaussian) with mean uz = E[A|L] and 
constant variance o?. We then used a linear regression model to estimate the 
mean E[A|L] and variance of residuals o? for all combinations of values of L. 
We also assumed that the density f(A) in the numerator was normal. One 
should be careful when using IP weighting for continuous treatments because 
the effect estimates may be exquisitely sensitive to the choice of the model for 
the conditional density f (A|L). 

Our IP weighted estimates of the parameters of the marginal structural 
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This is a saturated marginal struc- 
tural logistic model for a dichoto- 
mous treatment. For a continuous 
treatment, we would specify a non- 
saturated logistic model. 
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model were Bo = 2.005, Êi = —0.109, and Bo = 0.003. According to these 
estimates, the mean weight gain (95% confidence interval) would have been 2.0 
kg (1.4, 2.6) if all individuals had kept their smoking intensity constant, and 
0.9 kg (—1.7, 3.5) if all individuals had increased smoking by 20 cigarettes/day 
between baseline and 1982. 

One can also consider a marginal structural model for a dichotomous out- 
come. For example, if interested in the causal effect of quitting smoking A (1: 
yes, 0: no) on the risk of death D (1: yes, 0: no) by 1992, one could consider 
a marginal structural logistic model like 


logit Pr[D® = 1] = ao + aia 
where exp (a1) is the causal odds ratio of death for quitting versus not quitting 


smoking. The parameters of this model are consistently estimated, under our 
assumptions, by fitting the logistic model logit Pr[D = 1|A] = 09 + 01A to 
the pseudo-population created by IP weighting. We estimated the causal odds 
ratio to be exp (êi) = 1.0 (95% confidence interval: 0.8, 1.4). 


12.5 Effect modification and marginal structural models 


The parameter 83 does not gener- 
ally have a causal interpretation as 
the effect of V. Remember that we 
are assuming exchangeability, pos- 
itivity, and consistency for treat- 
ment A, not for sex V! 


Marginal structural models do not include covariates when the target parame- 
ter is the average causal effect in the population. However, one may include 
covariates—which may be non-confounders—in a marginal structural model to 
assess effect modification. Suppose it is hypothesized that the effect of smoking 
cessation varies by sex V (1: woman, 0: man). To examine this hypothesis, 
we add the covariate V to our marginal structural mean model: 


E[Y°|V] = bo + Biat b2Va + b3 V 


Additive effect modification is present if 62 4 0. Technically, this is not a mar- 
ginal model any more—because it is conditional on V—but the term “marginal 
structural model” is still applied. 

We can estimate the model parameters by fitting the linear regression model 
E[Y|A, V] = 00+-6:A+02VA+63V via weighted least squares with IP weights 
W4 or SW4. The vector of covariates L needs to include V—even if V is not a 
confounder—and any other variables that are needed to ensure exchangeability 
within levels of V. 

Because we are considering a model for the effect of treatment within levels 
of V, we now have the choice to use either f [A] or f [A|V] in the numera- 
tor of the stabilized weights. IP weighting based on the stabilized weights 


SWA(V) = at generally results in narrower confidence intervals around 
the effect estimates. Some intuition for the increased statistical efficiency of 
SW4 (V): with V in the conditioning event of both the numerator and the 
denominator, the numerical value of numerator and denominator gets closer, 
which results in added stabilization for (less variability in) the IP weights and 
therefore narrower 95% confidence intervals. We estimate SW (V) using the 
same approach as for SW“, except that we add the covariate V to the logistic 
model for the numerator of the weights. 

The particular subset V of L that an investigator chooses to include in the 
marginal structural model should only reflect the investigator’s substantive in- 
terest. For example, a variable V should be included in the marginal structural 
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If we were interested in the inter- 
action between 2 treatments A and 
B (as opposed to effect modifica- 
tion of treatment A by variable V; 
see Chapter 5), we would include 
parameters for both A and B in 
the marginal structural model, and 
would estimate IP weights with the 
joint probability of both treatments 
in the denominator. We would 
assume exchangeability, positivity, 
and consistency for A and B. 


IP weighting and marginal structural models 


model only if the investigator both believes that V may be an effect modifier 
and has greater substantive interest in the causal effect of treatment within 
levels of the covariate V than in the entire population. In our example, we 
found no strong evidence of effect modification by sex as the 95% confidence 
interval around the parameter estimate 02 was (—2.2, 1.9). If the investigator 
chooses to include all variables L in the marginal structural model, then the 
stabilized weights SW4 (L) equal 1 and no IP weighting is necessary because 
the (unweighted) outcome regression model, if correctly specified, fully adjusts 
for all confounding by L (see Chapter 15). For this reason, in a slightly hu- 
morous vein, we refer to a marginal structural model that conditions on all 
variables L needed for exchangeability as a faux marginal structural model. 

In Part I we discussed that effect modification and confounding are two 
logically distinct concepts. Nonetheless, many students have difficulty under- 
standing the distinction because the same statistical methods—stratification 
(Chapter 4) or regression (Chapter 15)—are often used both for confounder ad- 
justment and detection of effect modification. Thus, there may be some advan- 
tage to teaching these concepts using marginal structural models, because then 
methods for confounder adjustment (IP weighting) are distinct from methods 
for detection of effect modification (adding treatment-covariate product terms 
to a marginal structural model). 


12.6 Censoring and missing data 


When estimating the causal effect of smoking cessation A on weight gain Y, 
we restricted the analysis to the 1566 individuals with a body weight mea- 
surement at the end of follow-up in 1982. There were, however, 63 additional 
individuals who met our eligibility criteria but were excluded from the analysis 
because their weight in 1982 was not known. Selecting only individuals with 
nonmissing outcome values—that is, censoring from the analysis those with 
missing values—may introduce selection bias, as discussed in Chapter 8. 

Let censoring C be an indicator for measurement of body weight in 1982: 
1 if body weight is unmeasured (i.e., the individual is censored), and 0 if 
body weight is measured (i.e., the individual is uncensored). Our analysis 
was necessarily restricted to uncensored individuals, i.e., those with C = 0, 
because those were the only ones with known values of the outcome Y. That 
is, in sections 12.2 and 12.4 we did not fit the (weighted) outcome regression 
model E[Y|A] = 00 + 01A, but rather the model E[Y|A,C = 0] = 0o + 01A 
restricted to individuals with C = 0. 

Unfortunately, even under the null, selecting only uncensored individuals 
for the analysis is expected to induce bias when Č is either a collider on a 
pathway between treatment A and the outcome Y, or the descendant of one 
such collider. See the causal diagrams in Figures 8.3 to 8.6. Our data are 
consistent with the structure depicted by those causal diagrams: treatment A 
is associated with censoring C—5.8% of quitters versus 3.2% nonquitters were 
censored—and at least some predictors of Y are associated with C—the average 
baseline weight was 76.6 kg in the censored versus 70.8 in the uncensored. 

Because censoring due to loss to follow-up can introduce selection bias, we 
are generally interested in the causal effect if nobody in the study population 
had been censored. In our example, the goal becomes estimating the mean 
weight gain if everybody had quit smoking and nobody’s outcome had been 
censored, E[Y¢=1°=°], and the mean weight gain if nobody had quit smoking 


12.6 Censoring and missing data 


The IP weights for censoring 
and treatment are W4°0 = 
1/f (A,C =0|LZ), where the joint 
density of A and C is factored 
as f(A,;C=0|L) = f(A|L) x 
Pr [C = O|L, A]. 


Some variables in L may have 
zero coefficients in the model for 
f(A|Z) but not in the model 
for Pr[C = 0|L, A], or vice versa. 
Nonetheless, in large samples, it is 
always more efficient to keep all 
variables L that independenty pre- 
dict the outcome in both models. 


The estimated IP weights SWC 
have mean 1 when the model for 
Pr [C = 0|A] is correctly specified. 
See Technical Point 12.2 for more 
on stabilized IP weights. 
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The estimated IP weights SW4:° 
ranged from 0.35 to 4.09, and their 
mean was 1.00. 
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and nobody’s outcome had been censored E[Y7=°°=°]. Then the causal effect 
of interest is E[Y¢=1°°] — E[Y2=0°=°], a joint effect of A and C as we dis- 
cussed in Chapter 8. The use of the superscript c = 0 makes it explicit the 
causal contrast that many have in mind when they refer to the causal effect of 
treatment A, even if they choose not to use the superscript c = 0. 

This causal effect can be estimated by using IP weights W40 = W4 x WC 
in which W? = 1/Pr[C = O|L, A] for the uncensored individuals and WS = 0 
for the censored individuals. The IP weights W&C adjust for both confounding 
and selection bias under the identifiability conditions of exchangeability for the 
joint treatment (A,C) conditional on L—that is, Y®°=° LL (A, C) |L—, joint 
positivity for (A = a, C = 0), and consistency. If some of the variables in L are 
affected by treatment A, e.g., as in Figure 8.4, the conditional independence 
y= || (A, C) |L will not generally hold. In Part III we show that there are 
alternative exchangeability conditions that license us to use IP weighting to 
estimate the joint effect of A and C when some components of L are affected 
by treatment. 

Remember that the weights WC = 1/Pr[C =0|L, A] create a pseudo- 
population with the same size as that of the original study population be- 
fore censoring, and in which there is no arrow from either L or A into C. 
In our example, the estimates of IP weights for censoring WC will create a 
pseudo-population with (approximately) 1566 +63 = 1629 in which, under our 
assumptions, there is no selection bias because there is no selection. That is, 
we fit the weighted model E[Y|A,C = 0] = 0) + 01A with weights W4C to 
estimate the parameters of the marginal structural model E[Y“=°] = Bo + 81a 
in the entire population. 

Alternatively, one can use stabilized IP weights SW“-C = SW4 x SW°. 
The censoring weights SW° = Pr [C = 0|A] / Pr [C = O|L, A] create a pseudo- 
population of the same size as the original study population after censoring, 
and in which there is no arrow from L into C. In our example, the estimates 
of IP weights for censoring SW© will create a pseudo-population of (approx- 
imately) 1566 uncensored individuals. That is, the stabilized weights do not 
eliminate censoring in the pseudo-population, they make censoring occur at 
random with respect to the measured covariates L. Therefore, under our as- 
sumption of conditional exchangeability of censored and uncensored individ- 
uals given L (and A), the proportion of censored individuals in the pseudo- 
population is identical to that in the study population: there is selection but 
no selection bias. 

To obtain parametric estimates of Pr [C = 0|L, A] in our example, we fit a 
logistic regression model for the probability of being uncensored to the 1629 
individuals in the study population. The model included the same covariates 
we used earlier to estimate the weights for treatment. Under these paramet- 
ric restrictions, we obtained an estimate Pr[C = 0|L, A] and an estimate of 
SW® for each of the 1566 uncensored individuals. Using the stabilized weights 
SWAC = SW4 x SWC we estimated that quitting smoking increases weight 
by 0; = 3.5 kg (95% confidence interval: 2.5, 4.5) on average. This is almost the 
same estimate we obtained earlier using IP weights SW“, which suggests that 
either there is no selection bias by censoring or that our measured covariates 
are unable to eliminate it. 

We now describe an alternative to IP weighting to adjust for confounding 
and selection bias: standardization. 
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Technical Point 12.2 
f [A] 


More on stabilized weights. The stabilized weights SW4 = FAL are part of the larger class of stabilized weights 


where g [A] is any function of A that is not a function of L. When unsaturated structural models are used, 
A 1 
weights Ang are preferable over weights FAL 


to construct more efficient estimators of the causal effect in a nonsaturated marginal structural model. We now show 


g[A] 
f[AlZ]’ 





because there exist functions g [A] (often f [A]) that can be used 


A 
that the IP weighted mean with weights AY is equal to the counterfactual mean E[Y“]. 


First note that the IP weighted mean E Sa using weights 1/f [A|L], which is equal to E [Y°], can also 
R $ (A=a)Y 
FAIL) poe = l IJA] 
be expressed as because E | ————— | = 1. Similarly, the IP weighted mean using weights ———— 
f (AIL) i i See pA 








, which is also equal to E [Y°]. The proof proceeds as in Technical Point 2.2 


I(A=a) 
; | h el, I(A= 
to show that the numerator E | ; aaa s)| = E [Y°] g(a), and that the denominator E ra] = g(a). 


Chapter 13 


STANDARDIZATION AND THE PARAMETRIC G-FORMULA 


In this chapter we describe how to use standardization to estimate the average causal effect of smoking cessation 
on body weight gain. We use the same observational data set as in the previous chapter. Though standardization 
was introduced in Chapter 2, we only described it as a nonparametric method. We now describe the use of models 
together with standardization, which will allow us to tackle high-dimensional problems with many covariates and 
nondichotomous treatments. As in the previous chapter, we provide computer code to conduct the analyses. 

In practice, investigators will often have a choice between IP weighting and standardization as the analytic 
approach to obtain effect estimates from observational data. Both methods are based on the same identifiability 
conditions, but on different modeling assumptions. 


13.1 Standardization as an alternative to IP weighting 


As in the previous chapter, we will 
assume that the components of L 
required to adjust for C are unaf- 
fected by A. Otherwise, we would 
need to use the more general ap- 
proach described in Part III. 


In the previous chapter we estimated the average causal effect of smoking ces- 
sation A (1: yes, 0: no) on weight gain Y (measured in kg) using IP weighting. 
In this chapter we will estimate the same effect using standardization. Our 
analyses will also be based on NHEFS data from 1629 cigarette smokers aged 
25-74 years who had a baseline visit and a follow-up visit about 10 years later. 
Of these, 1566 individuals had their weight measured at the follow-up visit and 
are therefore uncensored (C = 0). 

We define E[Y%°=°] as the mean weight gain that would have been observed 
if all individuals had received treatment level a and if no individuals had been 
censored. The average causal effect of smoking cessation can be expressed as 
the difference E[Y¢=1=°] — E[Y°=.-=°], that is, the difference in mean weight 
that would have been observed if everybody had been treated and uncensored 
compared with untreated and uncensored. 

As shown in Table 12.1, quitters (A = 1) and non-quitters (A = 0) differ 
with respect to the distribution of predictors of weight gain. The observed 
associational difference E[Y|A = 1,C = 0] — E[Y|A = 0,C = 0] = 2.5 is 
expected to differ from the causal difference E[Y@=}°-°] — E[Y7=°°=9], Again 
we assume that the vector of variables L is sufficient to adjust for confounding 
and selection bias, and that L includes the baseline variables sex (0: male, 
1: female), age (in years), race (0: white, 1: other), education (5 categories), 
intensity and duration of smoking (number of cigarettes per day and years of 
smoking), physical activity in daily life (3 categories), recreational exercise (3 
categories), and weight (in kg). 

One way to adjust for the variables L is IP weighting, which creates a 
pseudo-population in which the distribution of the variables in L is the same 
in the treated and in the untreated. Then, under the assumptions of exchange- 
ability and positivity given L, we estimate E[Y“°=°] by simply computing 
E[Y|A = a,C = 0] as the average outcome in the pseudo-population. If A 
were a continuous treatment (contrary to our example), we would also need a 
structural model to estimate E[Y|A,C = 0] in the pseudo-population for all 
possible values of A. IP weighting requires estimating the joint distribution of 
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Fine Point 13.1 


Structural positivity. Lack of structural positivity precludes the identification of the average causal effect in the entire 
population when using IP weighting. Positivity is also necessary for standardization because, when Pr [A = a|L = l] = 0 
and Pr [L = 1] Æ 0, then the conditional mean outcome E[Y|A = a, L = l] is undefined. 

But the practical impact of deviations from positivity may vary greatly between IP weighted and standardized 
estimates that rely on parametric models. When using standardization, one can ignore the lack of positivity if one 
is willing to rely on parametric extrapolation. That is, one can fit a model for E[Y |A, L] that will smooth over the 
strata with structural zeroes. This smoothing will introduce bias into the estimation, and therefore the nominal 95% 
confidence intervals around the estimates will cover the true effect less than 95% of the time. Also, note the different 
purpose of modeling in this setting with structural positivity: we model not because we lack enough data, but because 
we want to estimate a quantity that cannot be identified even with an infinite amount of data (because of structural 


non-positivity). This is an important distinction. 

In general, in the presence of violations or near-violations of positivity, the standard error of the treatment effect will 
be smaller for standardization than for IP weighting. This does not necessarily mean that standardization is preferred 
over IP weighting; the difference in the biases may swamp the differences in standard errors. 





Technical Point 2.3 proves that, 
under conditional exchangeability, 
positivity, and consistency, the 
standardized mean in the treated 
equals the mean if everyone had 
been treated. The extension to cen- 
soring is trivial: just replace A = a 
by (A = a, C = 0) in the proof and 
definitions. 


The average causal effect in the 
treated can be estimated by stan- 
dardization as described in Techni- 
cal Point 4.1. One just needs to 
replace Pr[L = l] by Pr[L = l| A = 
1] in the expression to the right. 


treatment and censoring. For the dichotomous treatment smoking cessation, 
we estimated Pr [A = a, C = 0| L] and computed IP probability weights with 
this joint probability in the denominator. 

As discussed in Chapter 2, an alternative to IP weighting is standardiza- 
tion. Under exchangeability and positivity conditional on the variables in L, 
the standardized mean outcome in the uncensored treated is a consistent es- 
timator of the mean outcome if everyone had been treated and had remained 
uncensored E[Y°=1°=°]. Analogously, the standardized mean outcome in the 
uncensored untreated is a consistent estimator of the mean outcome if everyone 
had been untreated and had remained uncensored E[Y°=°°=°]. See Fine Point 
13.1 for a discussion of the relative impact of deviations from positivity in IP 
weighting and in standardization. 

To compute the standardized mean outcome in the uncensored treated, we 
first need to compute the mean outcomes in the uncensored treated in each 
stratum l of the confounders L, i.e., the conditional means E[Y|A = 1,C = 
0, L = l] in each of the strata l. In our smoking cessation example, we would 
need to compute the mean weight gain Y among those who quit smoking and 
remained uncensored in each of the (possibly millions of) strata defined by the 
combination of values of the 9 variables in L. 

The standardized mean in the uncensored treated is then the weighted 
average of these conditional means using as weights the prevalence of each 
value J in the study population, i.e., Pr[Z = J]. That is, the conditional mean 
from the stratum with the greatest number of individuals has the greatest 
weight in the computation of the standardized mean. The standardized mean 
in the uncensored untreated is computed analogously except that the A = 1 in 
the conditioning event is replaced by A = 0. 

More compactly, the standardized mean in the uncensored who received 
treatment level a is 


SC EY|A =a,C =0,L=1] x Pr[L=]] 
l 
When, as in our example, some of the variables in L are continuous, one needs 


to replace Pr [L = l] by the probability density function (PDF) fz [l], and the 
above sum becomes an integral. 
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13.2 Estimating the mean 
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In general, the standardized mean 
of Y is written as 
fElY|A=a,C =0,L=]] dF, (Ù) 
where Fz (-) is the joint cumulative 
distribution function (CDF) of the 
random variables in L. When, as in 
this chapter, L is a vector of base- 
line covariates unaffected by treat- 
ment, we can average over the ob- 
served values of L to nonparamet- 
rically estimate this integral. 


The next two sections describe how to estimate the conditional means of 
the outcome Y and the distribution of the confounders L, the two types of 
quantities required to estimate the standardized mean. 


outcome via modeling 


Ideally, we would estimate the set of conditional means E[Y|A = 1,C = 0, L = 
i] nonparametrically. We would compute the average outcome among the un- 
censored treated in each of the strata defined by different combination of values 
of the variables L. This is precisely what we did in Section 2.3, where all the 
information required for this calculation was taken from Table 2.2. 

But nonparametric estimation of E[Y|A = 1,C = 0,L = l] is out of the 
question when, as in our current example, we have high-dimensional data with 
many confounders, some of them with multiple levels. We cannot obtain mean- 
ingful nonparametric stratum-specific estimates of the mean outcome in the 
treated when there are only 403 treated individuals distributed across millions 
of strata. We need to resort to modeling. The same rationale applies to the con- 
ditional mean outcome in the uncensored untreated E[Y|A = 0,C = 0, L = l]. 

To obtain parametric estimates of E[Y|A = a,C = 0, L = l] in each of the 
millions of strata defined by L, we fit a linear regression model for the mean 
weight gain with treatment A and all 9 confounders in L included as covariates. 
We used linear and quadratic terms for the (quasi-)continuous covariates age, 
weight, intensity and duration of smoking. That is, our model restricts the 
possible values of E[Y|A = a,C = 0, L = l] such that the conditional relation 
between the continuous covariates and the mean outcome can be represented 
by a parabolic curve. We included a product term between smoking cessation 
A and intensity of smoking. That is, our model imposes the restriction that 
each covariate’s contribution to the mean is independent of that of the other 
covariates, except that the contribution of smoking cessation A varies linearly 
with intensity of prior smoking. a 

Under these parametric restrictions, we obtained an estimate E[Y|A = 
a,C = 0,L = l] for each combination of values of A and L, and therefore 
for each of the 403 uncensored treated (A = 1,C = 0) and each of the 1163 
uncensored untreated (A = 0,C = 0) individuals in the study population. 
For example, we estimated that individuals with the combination of values 
{non-quitter, male, white, age 26, college dropout, 15 cigarettes/day, 12 years 
of smoking habit, moderate exercise, very active, weight 112 kg} had a mean 
weight gain of 0.34 kg (the individual with unique identifier 24770 happened to 
have these combination of values, you may take a look at his predicted value). 
Overall, the mean of the estimated weight gain was 2.6 kg, same as the mean of 
the observed weight gain, which ranged from —41.3 to 48.5 kg across different 
combinations of covariates. 

Remember that our goal is to estimate the standardized mean )7, E[Y|A = 
a,C = 0, L =1)xPr|[£ = l] in the treated (A = 1) and in the untreated (A = 0). 
More formally, the standardized mean should be written as an integral because 
some of the variables in L are essentially continuous, and thus their distribution 
cannot be represented by a probability function. Regardless of these notational 
issues, we have already estimated the means E[Y|A = a,C = 0, L = l] for all 
values of treatment A and confounders L. 

The next step is standardizing these means to the distribution of the con- 
founders L for all values l. 
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13.3 Standardizing the mean outcome to the confounder distribution 
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The standardized mean is a weighted average of the conditional means E[Y|A = 
a,C =0,L = l]. When all variables in L are discrete, each mean receives a 
weight equal to the proportion of individuals with values L = l, i.e., Pr [L = l]. 
In principle, these proportions Pr [L = l] could be calculated nonparametri- 
cally from the data: we would divide the number of individuals in the strata 
defined by L = l by the total number of individuals in the population. This is 
precisely what we did in Section 2.3, where all the information required for this 
calculation was taken from Table 2.2. However, this method becomes tedious 
for high-dimensional data with many confounders, some of them with multiple 
levels, as in our smoking cessation example. 

Fortunately, we do not need to estimate Pr [|L = l]. We only need to es- 
timate E[Y|A=a,C =0,L=1] for the l value of each individual i in the 


study, and then compute the average + y E [Y|A =a,C = 0, Li] where n is 


i=l 
the number of individuals in the study. This is so because the weighted mean 
X E[Y|A=a,C =0, L = l] Pr[L=]] can also be written as the double ex- 


1 
pectation E [E [Y|A = a, C = 0, L]]. 

We now describe a simple computational method to estimate the standard- 
ized means }, E[Y|A = a, C = 0, L = l] x Pr [L = l] in the treated (A = 1) and 
in the untreated (A = 0) with many confounders, without ever explicitly esti- 
mating Pr |L = l]. We first apply the method to the data in Table 2.2, in which 
there was no censoring, the confounder L is only one variable with two levels, 
and Y is a dichotomous outcome, i.e., the mean E[Y|A = a,C = 0, L = l] is the 
risk Pr[Y = 1|A =a, L = lj of developing the outcome. Then we apply it to 
the real data with censoring and many confounders. The method has 4 steps: 
expansion of dataset, outcome modeling, prediction, and standardization by 
averaging. 

Table 2.2 has 20 rows, one per individual in the study. We now create a 
new dataset in which the data of Table 2.2 is copied three times. That is, the 
analytic dataset has 60 rows in three blocks of 20 individuals each. We leave 
the first block of 20 rows as is, i.e., the first block is identical to the data in 
Table 2.2. We modify the data of the second and third blocks as shown in the 
margin. In the second block, we set the value of A to 0 (untreated) for all 
20 individuals; in the third block we set the value of A to 1 (treated) for all 
individuals. In the second and third blocks, we delete the data on the outcome 
for all individuals, i.e., the variable Y is assigned a missing value. As described 
below, we will use the second block to estimate the standardized mean in the 
untreated and the third block for the standardized mean in the treated. 

Next we use the 3-block dataset to fit a regression model for the mean 
outcome given treatment A and the confounder L. We add a product term 
A x L to make the model saturated. Note that only the rows in the first 
block of the dataset (the actual data) will contribute to the estimation of the 
parameters of the model because the outcome is missing for all rows in the 
second and third blocks. 

The next step is to use the parameter estimates from the first block to 
predict the outcome values for all rows in the second and third blocks. (That 
is, we combine the values of the columns L and A with the regression estimates 
to impute the missing value for the outcome Y.) The predicted outcome values 
for the second block are the mean estimates for each combination of values of L 
and A = 0, and the predicted values for the third block are the mean estimates 
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CODE: Program 13.3 


CODE: Program 13.4 


for each combinations of values of L and A = 1. 

Finally, we compute the average of all predicted values in the second block. 
Because 60% of rows have value L = 1 and 40% have value L = 0, this average 
gives more weight to rows with L = 1. That is, the average of all predicted 
values in the second block is precisely the standardized mean outcome in the 
untreated. We are done. To estimate the standardized mean outcome in the 
treated, we compute the average of all predicted values in the third block. 

The above procedure yields exactly the same estimates of the standardized 
means (0.5 for both of them) as the direct calculation in Section 2.3. Both 
approaches are completely nonparametric. In this chapter we did not directly 
estimate the distribution of L, but rather average over the observed values of 
L, i.e., its empirical distribution. 

The use of the empirical distribution for standardizing is the way to go in 
more realistic examples, like our smoking cessation study, with high-dimensional 
L. The procedure for our study is analogous to the one described above for 
the data in Table 2.2. We add the second and third blocks to the dataset, fit 
the regression model for E[Y|A = a, C = 0, L = l] as described in the previous 
section, and generate the predicted values. The average predicted value in the 
second block—the standardized mean in the untreated—was 1.66, and the aver- 
age predicted value in the third block—the standardized mean in the treated— 
was 5.18. Therefore, our estimate of the causal effect E[Y°=!=°] —E[Y7=9,=®| 
was 5.18 — 1.66 = 3.5 kg. To obtain a 95% confidence interval for this estimate 
we used a statistical technique known as bootstrapping (see Technical Point 
13.1). In summary, we estimated that quitting smoking increases body weight 
by 3.5 kg (95% confidence interval: 2.6, 4.5). 


13.4 IP weighting or standardization? 


We have now described two ways in which modeling can be used to estimate 
the average causal effect of a treatment: IP weighting (previous chapter) and 
standardization (this chapter). In our smoking cessation example, both yielded 
almost exactly the same effect estimate. Indeed Technical Point 2.3 proved that 
the standardized mean equals the IP weighted mean. 

Why are we then bothering to estimate the standardized mean in this chap- 
ter if we had already estimated the IP weighted mean in the previous chapter? 
It turns out that the IP weighted and the standardized mean are only ex- 
actly equal when no models are used to estimate them. Otherwise they are 
expected to differ. To see this, consider the quantities that need to be mod- 
eled to implement either IP weighting or standardization. IP weighting mod- 
els Pr [A = a, C = 0|L], which we estimated in the previous chapter by fitting 
parametric logistic regression models for Pr [A = a|L] and Pr [C = 0|A =a, L]. 
Standardization models the conditional means E[Y|A = a, C = 0, L = l], which 
we estimated in this chapter using a parametric linear regression model. 

In practice some degree of misspecification is inescapable in all models, and 
model misspecification will introduce some bias. But the misspecification of 
the treatment model (IP weighting) and the outcome model (standardization) 
will not generally result in the same magnitude and direction of bias in the ef- 
fect estimate. Therefore the IP weighted estimate will generally differ from the 
standardized estimate because unavoidable model misspecification will affect 
the point estimates differently. Large differences between the IP weighted and 
standardized estimate will alert us to the presence of serious model misspec- 
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Technical Point 13.1 


Bootstrapping. Effect estimates are presented with measures of random variability, such as the standard error or the 
95% confidence interval, which is a function of the standard error. (We discussed the foundations of variability in Chapter 
10.) Because of the computational difficulty to obtain exact estimates, in practice standard error estimates are often 
based on large-sample approximations, which rely on asymptotic considerations. However, sometimes even large-sample 
approximations are too complicated to be calculated. The bootstrap is an alternative method for estimating standard 
errors and computing 95% confidence intervals. The simplest version of the bootstrap, which we used to compute the 
95% confidence interval around the effect estimate of smoking cessation, is sketched below. 

Take the study population of 1629 individuals. Sample with replacement 1629 individuals from the study population, 
so that some of the original individuals may appear more than once while others may not be included at all. This new 
sample of size 1629 is referred to as a “bootstrap sample.” Compute the effect of interest in the bootstrap sample (e.g., 
by using standardization as described in the main text). Now create a second bootstrap sample by again sampling with 
replacement 1629 individuals. Compute the effect of interest in the second bootstrap sample using the same method 
as for the first bootstrap sample. By chance, the first and second bootstrap sample will generally include a different 
number of copies of each individual, and therefore will result in different effect estimates. Repeat the procedure in a 
large number (say, 1000) of bootstrap samples. It turns out that the standard deviation of the 1000 effect estimates in 
the bootstrap samples consistently estimates the standard error of the effect estimate in the study population. The 95% 
confidence interval is then computed by using the usual normal approximation: +1.96 times the estimate of the standard 
error. See, for example, Wasserman (2004) for an introduction to the statistical theory underlying the bootstrap. 

We used this bootstrap method with 1000 bootstrap samples to obtain the 95% confidence interval described in 
the main text for the standardized mean difference. Though the bootstrap is a simple method, it can be computationally 
intensive for very large datasets. It is therefore common to see published estimates that are based on only 200-500 
bootstrap samples (which would have resulted in an almost identical confidence interval in our example). Finally, note 
that the bootstrap is a general method for large samples. We could have also used it to compute a 95% confidence 
interval for the IP weighted estimates from marginal structural models in the previous chapter. 








ification in at least one of the estimates. Small differences do not guarantee 
absence of serious model misspecification, but will be reassuring—though logi- 
cally possible, it is unlikely that badly misspecified models resulting in bias of 
similar magnitude and direction for both methods. 


In our smoking cessation example, both the IP weighted and the standard- 
ized estimates are similar. After rounding to one decimal place, the estimated 
weight gain due to smoking cessation was 3.5 kg regardless of whether we fit a 
model for treatment A (IP weighting) or for the outcome Y (standardization). 
In neither case we fit a model for the confounders L, as we did not need the 
distribution of the confounders to obtain the IP weighted estimate, and we just 
used the empirical distribution of L (a nonparametric method) to compute the 
standardized estimate. 


Both IP weighting and standardization are estimators of the g-formula, a 
general method for causal inference first described in 1986. (Part III provides 
a definition of the g-formula in settings with time-varying treatments.) We 
say that standardization is a plug-in g-formula estimator because it simply re- 
places the conditional mean outcome in the g-formula by its estimates. When, 
like in this chapter, those estimates come from parametric models, we refer to 
the method as the parametric g-formula. Because here we were only interested 
in the average causal effect, we only had to estimate the conditional mean 
outcome. More generally, the parametric g-formula uses estimates of any func- 
tions of the distribution of the outcome (e.g., functionals like the probability 
density function or PDF) within levels of A and L to compute its standardized 
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A doubly robust estimator. The previous chapter describes IP weighting, a method that requires a correct model for 
treatment A conditional on the confounders L. This chapter describes standardization, a method that requires a correct 
model for the outcome Y conditional on treatment A and the confounders L. How about a method that requires a 
correct model for either treatment A or outcome Y? That is precisely what doubly robust estimation does. Under the 
usual identifiability assumptions, a doubly robust estimator consistently estimates the causal effect if at least one of the 
two models is correct (and one need not know which of the two models is correct). That is, doubly robust estimators 
give us two chances to get it right. 

There are many types of doubly robust estimators. Here we describe a doubly robust estimator proposed by Bang 
and Robins (2005) for the average causal effect of a dichotomous treatment A on an outcome Y. For simplicity, we 
consider a setting without censoring. 

To obtain a doubly robust estimate of the average causal effect, first estimate the IP weight W4 = 1/f (AIL) 
as described in the previous chapter. Then fit an outcome regression model like the one described in this chapter—a 
generalized linear model with a canonical link—for E[Y|A = a, L = l, R] that adds the covariate R, where R = W4 if 
A=land R=-W4 if A=0. Finally, use the predicted values from the outcome model to obtain the standardized 
mean outcomes under A = 1 and A = 0. The difference of the standardized mean outcomes is now doubly robust. 
That is, under exchangeability and positivity given L, this estimator consistently estimates the average causal effect if 
either the model for the treatment or the model for the outcome is correct, without knowing which of the two models 
is the correct one. 





Robins (1986) described the gen- value. In the absence of time-varying confounders (see Part III), the paramet- 


eralization of standardization to 
time-varying treatments and con- 
founders, and named it the g- 
computation algorithm formula. 
Because this name is very long, 
some authors have abbreviated 
to g-formula and others to g- 
computation. Even though g- 
formula and g-computation are syn- 
onyms, this book uses only the 
former term because the latter 
term is frequently confused with g- 
estimation, a different method de- 
scribed in Chapter 14. 


ric g-formula does not require parametric modeling of the distribution of the 
confounders. 

Often there is no need to choose between IP weighting and the parametric 
g-formula. When both methods can be used to estimate a causal effect, just 
use both methods. Also, whenever possible, use doubly robust methods (see 
Fine Point 13.2 and Technical Point 13.2) that combine models for treatment 
and for outcome in the same approach. 

Finally, note that we used the parametric g-formula to estimate the average 
causal effect in the entire population of interest. Had we been interested in 
the average causal effect in a particular subset of the population, we could 
have restricted our calculations to that subset. For example, if we had been 
interested in potential effect modification by sex, we would have estimated the 
standardized means in men and women separately. Both IP weighting and the 
parametric g-formula can be used to estimate average causal effects in either 
the entire population or a subset of it. 


13.5 How seriously do we take our estimates? 


We spent Part I of this book reviewing the definition of average causal ef- 
fect, the assumptions required to estimate it, and many potential biases. The 
discussion was purely conceptual, the data examples hypersimplistic. A key 
message was that a causal analysis of observational data is sharper when ex- 
plicitly emulating a (hypothetical) randomized experiment—the target trial. 
The analyses in this and the previous chapter are our first attempts at 
estimating causal effects from real data. Using both IP weighting and the 
parametric g-formula we estimated that the mean weight gain would have 
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Methods based on outcome regres- 
sion (including douby robust meth- 
ods) can be used in the absence 
of positivity, under the assumption 
that the outcome model is correctly 
specified to extrapolate beyond the 
data. See Fine Point 13.1. 


This dependence of the numerical 
estimate on the exact interventions 
is important when the estimates are 
used to guide decision making, e.g., 
in public policy or clinical medicine 
(Hernán 2016). 


The validity of our causal inferences 
requires the following conditions 


e exchangeability 

e positivity 

e consistency 

e no measurement error 


e no model misspecification 


Standardization and the parametric g-formula 


been 5.2 kg if everybody had quit smoking compared with 1.7 kg if nobody 
had quit smoking. Both methods estimated that quitting smoking increases 
weight by 3.5 kg (95% confidence interval: 2.5, 4.5) on average in this particular 
population. In the next chapters we will see that similar estimates are obtained 
when using g-estimation, outcome regression, and propensity scores. 

The compatibility of estimates across methods is reassuring because each 
method’s estimate is based on different modeling assumptions. However, ob- 
servational effect estimates are always open to serious criticism. Even if we 
do not wish to transport our effect estimate to other populations (Chapter 4) 
and even if there is no interference between individuals, the validity of our es- 
timates for the target population requires many conditions. We classify these 
conditions in three groups. 

First, the identifiability conditions of exchangeability, positivity, and con- 
sistency (Chapter 3) need to hold for the observational study to resemble the 
target trial. The quitters and the non-quitters need to be exchangeable con- 
ditional on the 9 measured covariates L (see Fine Point 14.2). Unmeasured 
confounding (Chapter 7) or selection bias (Chapter 8, Fine Point 12.2) would 
prevent conditional exchangeability. Positivity requires that the distribution 
of the covariates L in the quitters fully overlaps with that in the non-quitters. 
Fine Point 13.1 discussed the different impact of deviations from positivity 
for nonparametric IP weighting and standardization. Regarding consistency, 
note that there are multiple versions of both quitting smoking (e.g., quitting 
progressively, quitting abruptly) and not quitting smoking (e.g., increasing in- 
tensity of smoking by 2 cigarettes per day, reducing intensity but not to zero). 
Our effect estimate corresponds to a somewhat vague hypothetical interven- 
tion in the target population that randomly assigns these versions of treatment 
with the same frequency as they actually have in the study population. Other 
hypothetical interventions might result in a different effect estimate. 

Second, all variables used in the analysis need to be correctly measured. 
Measurement error in the treatment A, the outcome Y, or the confounders L 
will generally result in bias (Chapter 9). 

Third, all models used in the analysis need to be correctly specified (Chap- 
ter 11). Suppose that the correct functional form for the continuous covariate 
age in the treatment model is not the parabolic curve we used but rather a 
curve represented by a complex polynomial. Then, even if all the confounders 
had been correctly measured and included in L, IP weighting would not fully 
adjust for confounding. Model misspecification has a similar effect as measure- 
ment error in the confounders. 

Ensuring that each of these conditions hold, at least approximately, is the 
investigator’s most important task. If these conditions could be guaranteed 
to hold, then the data analysis would be trivial. The problem is, of course, 
that one cannot ever expect that any of these conditions will hold perfectly. 
Unmeasured confounders, nonoverlapping confounder distributions, ill-defined 
interventions, mismeasured variables, and misspecified models will typically 
lurk behind our estimates. Some of these problems may be addressed em- 
pirically, but others will remain a matter of subject-matter judgement, and 
therefore open to criticism that cannot be refuted by our data. For example, 
we can propose different model specifications but we cannot adjust for variables 
that were not measured. 

Causal inferences rely on the above conditions, which are heroic and not 
empirically testable. We replace the lack of data on the distribution of the 
counterfactual outcomes by the assumption that the above conditions are ap- 
proximately met. The more our study deviates from those conditions, the 
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more biased our effect estimate may be. Therefore a healthy skepticism of 
causal inferences drawn from observational data is necessary. In fact, a key 
step towards less casual causal inferences is the realization that the discussion 
should primarily revolve around each of the above assumptions. We only take 
our effect estimates as seriously as we take the conditions that are needed to 
endow them with a causal interpretation. 
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Technical Point 13.2 


The bias of doubly robust estimators. Suppose we have a dichotomous treatment A, an outcome Y, and a vector 
of measured variables L that satisfy positivity and exchangeability (consistency is assumed). For simplicity we consider 
estimation of the counterfactual mean outcome under treatment E[Y°='] rather than the average causal effect. Then 
E[Y7="] can be written as either E[b(L)], where b(L) = BNA = 1, L], or E[ Zp). where q(L) = Pr[A = 1|L]. In 


this chapter, we described a plug-in g-formula estimator + iy b(L i) that replaces the conditional mean outcome by its 


estimate from a (say, linear) regression model for b(L) and ayera es over all n individuals in the TA In the previous 
A, a 








chapter, we described a Horvitz-Thompson IP weighted estimator = 3 


by its estimate from a (say, logistic) regression model for 7(L) and TRE it over the n individuals. The bias of the 
plug-in g-formula estimator will be large if the estimate b(L) is far from b(L), and the bias of the IP weighted estimator 
will be large if 7(Z) is far from a(L). 

A doubly robust estimator of E[Y°=!] appropriately combines the estimate b(L) from the outcome model and the 
estimate 7(L) from the treatment model. There are many forms of doubly robust estimators, like the one described in 
Fine Point 13.2 for the average causal effect. All doubly robust estimators involve a correction of the outcome regression 
model by a function that involves the treatment model (the first-order influence function), which can also be viewed as 
a correction of the Horvitz- Thompson estimator by a function that involves the outcome regression model. For example, 
consider the following doubly robust estimator of E[Y ="): 





PIY"! ppr = 5 W + mw (vi — bra) ; 





which can also be written as + 23 É a En 1) ê(L:)]: 


Under exchangeability afd: “positivity, the bias of this doubly robust estimator of E[Y“=1] is small if either the 
estimate b(L) is close to b(L) or the estimate #(L) is close to 7(L). Specifically, the bias E [EvJor — hye] 


of E[Y°*="|pr in large samples is 


Eho (soy - sty) e-o]. 


where *(l) and b*(L) are the probability limits of #(1) and b(1) and *(1) = (lL) when the treatment model is correct 
and b*(1) = b(1) when the outcome model is correct. Thus the large sample (i.e., asymptotic) bias is zero when either 
the outcome model or the treatment model is correct (and we do not need to know which one). Of course, one does 
not expect any parametric model to be correctly specified if the vector L is very high-dimensional and thus even the 
bias of our doubly robust estimator may be large. 

However, all cee robust estimators have the property that the bias depends on the product of the error aD = aD 


in the estimation of with the error b(1) — b(1) in the estimation of b(1). As we discuss in Chapter 18, this property— 


I 
which is known as oO bias—allows us to construct doubly-robust estimators of E[Y°="] that may have small 
bias by estimating a(1) and b(l) with machine learning estimators rather than with standard parametric models. This is 
because, in high-dimensional settings in which large amounts of data are available, machine learning estimators based 
on complex algorithms, produce more accurate estimators of m(l) and b(1) than standard parametric models which have 


a small number of parameters compared to the sample size. 


Chapter 14 
G-ESTIMATION OF STRUCTURAL NESTED MODELS 


In the previous two chapters, we described IP weighting and standardization to estimate the average causal effect 
of smoking cessation on body weight gain. In this chapter we describe a third method to estimate the average 
causal effect: g-estimation. We use the same observational NHEFS data and provide simple computer code to 
conduct the analyses. 

IP weighting, standardization, and g-estimation are often collectively referred to as g-methods because they 
are designed for application to generalized treatment contrasts involving treatments that vary over time. The 
application of g-methods to treatments that do not vary over time in Part II of this book may then be overkill 
since there are alternative, simpler approaches. However, by presenting g-methods in a relatively simple setting, 
we can focus on their main features while avoiding the more complex issues described in Part III. 

IP weighting and standardization were introduced in Part I (Chapter 2) and then described with models in 
Part II (Chapters 12 and 13, respectively). In contrast, we have waited until Part II to describe g-estimation. 
There is a reason for that: describing g-estimation is facilitated by the specification of a structural model, even if 
the model is saturated. Models whose parameters are estimated via g-estimation are known as structural nested 
models. The three g-methods are based on different modeling assumptions. 


14.1 The causal question revisited 


In the last two chapters we have applied IP weighting and standardization to 

estimate the average causal effect of smoking cessation (the treatment) A on 
As in previous chapters, we re- weight gain (the outcome) Y. To do so, we used data from 1566 cigarette 
stricted the analysis to NHEFS indi- smokers aged 25-74 years who were classified as treated A = 1 if they quit 
viduals with known sex, age, race, smoking, and as untreated A = 0 otherwise. We assumed that exchangeability 
weight, height, education, alcohol of the treated and the untreated was achieved conditional on the L variables: 
use and intensity of smoking at sex, age, race, education, intensity and duration of smoking, physical activity 
the baseline (1971-75) and follow- in daily life, recreational exercise, and weight. We defined the average causal 
up (1982) visits, and who answered. effect on the difference scale as E[Y°='-°] -E[Y*-°°°], that is, the difference 
the medical history questionnaire at in mean weight that would have been observed if everybody had been treated 
baseline. and uncensored compared with untreated and uncensored. 

The quantity E[Y°='°-°] — E[y°=°.-=°] measures the average causal ef- 
fect in the entire population. But sometimes one can be interested in the 
average causal effect in a subset of the population. For example, one may 
want to estimate the average causal effect in women—E[Y°=1!°|woman] — 
E[Y*-°--=°|woman|—, in individuals aged 45, in those with low educational 
level, etc. To estimate the effect in a subset of the population one can use 
marginal structural models with product terms (see Chapter 12) or apply stan- 
dardization to that subset only (Chapter 13). 

Suppose that the investigator is interested in estimating the causal effect 
of smoking cessation A on weight gain Y in each of the strata defined by 
combinations of values of the variables L. In our example, there are many such 
strata. One of them is the stratum {non-quitter, male, white, age 26, college 
dropout, 15 cigarettes/day, 12 years of smoking habit, moderate exercise, very 
active, weight 112 kg}. As described in Chapter 4, investigators could partition 
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the study population into mutually exclusive subsets or non-overlapping strata, 
each of them defined by a particular combination of values l of the variables 
in L, and then estimate the average causal effect in each of the strata. In 
Section 12.5 we explain that an alternative approach is to add all variables 
L, together with product terms between each component of L and treatment 
A, to the marginal structural model. Then the stabilized weights SW4 (L) 
equal 1 and no IP weighting is necessary because the (unweighted) outcome 
regression model, if correctly specified, fully adjusts for all confounding by L 
(see Chapter 15). 

In this chapter we will use g-estimation to estimate the average causal effect 
of smoking cessation A on weight gain Y in each strata defined by the covari- 
ates L. This conditional effect is represented by E[Y@=1°—°|L] —E[Y2=°9| L]. 
Before describing g-estimation, we will present structural nested models and 
rank preservation, and, in the next section, articulate the condition of ex- 
changeability given L in a new way. 


14.2 Exchangeability revisited 


You may find the first paragraph 
of this section repetitious and un- 
necessary given our previous discus- 
sions of conditional exchangeability. 
If that is the case, we could not be 
happier. 


For simplicity, in this book we do 
not distinguish between vector and 
scalar parameters. This is an abuse 
of notation, but we believe it does 
not create any confusion. 


As a reminder (see Chapter 2), in our example, conditional exchangeability im- 
plies that, in any subset of the study population in which all individuals have 
the same values of L, those who did not quit smoking (A = 0) would have had 
the same mean weight gain as those who did quit smoking (A = 1) if they had 
not quit, and vice versa. In other words, conditional exchangeability means 
that the outcome distribution in the treated and the untreated would be the 
same if both groups had received the same treatment level. When the distrib- 
ution of the outcomes Y® under treatment level a is the same for the treated 
and the untreated, each of the counterfactual outcomes Y® is independent of 
the actual treatment level A, within levels of the covariates, or Y*1LA|L for 
both a = 1 and a = 0. 

Take the counterfactual outcome under no treatment Y°~°. When condi- 
tional exchangeability holds, knowing the value of Y°~° does not help differ- 
entiate between quitters and nonquitters with a particular value of L. That is, 
the conditional (on L) probability of being a quitter is the same for all values 
of the counterfactual outcome Y°=0, Mathematically, we write 


Pr[A = 1[Y¥°=, L] = Pr[A = 1|L] 


which is an equivalent definition of conditional exchangeability for a dichoto- 
mous treatment A. 

Expressing conditional exchangeability in terms of the conditional proba- 
bility of treatment will be helpful when we describe g-estimation later in this 
chapter. Specifically, suppose we propose the following parametric logistic 
model for the probability of treatment 


logit Pr[A = 1|Y °=, L] = ao + a1Y°™ + aL 


where a is a vector of parameters, one for each component of L. If L has p 
components L1, ...Lp then œ2L = )"_, a2;L;. This model is the same one we 
used to estimate the denominator of the IP weights in Chapter 12, except that 
this model also includes the counterfactual outcome Y*~° as a covariate. 

Of course, we can never fit this model to a real data set because we do 
not know the value of the variable Y°=® for all individuals. But suppose for 
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a second that we had data on Y?~° for all individuals, and that we fit the 
above logistic model. If there is conditional exchangeability and the model 
is correctly specified, what estimate would you expect for the parameter a,? 
Pause and think about it before going on (the response can be found near the 
end of this paragraph) because we will be estimating the parameter a, when 
implementing g-estimation. If you have already guessed what its value should 
be, you have already understood half of g-estimation. Yes, the expected value 
of the estimate of a, is zero because Y°=° does not predict A conditional on 
L. We now introduce the other half of g-estimation: the structural model. 


14.3 Structural nested mean models 


Robins (1991) first described the 
class of structural nested models. 
These models are “nested” when 
the treatment is time-varying. See 
Part III for an explanation. 


We are interested in estimating the average causal effect of treatment A within 
levels of L, that is, E[Y°='|L] — E[Y°=°|L]. (For simplicity, suppose there is 
no censoring until later in this section.) Note that we can also represent this 
effect by E[Y°=! — Y°=°|L] because the difference of the means is equal to the 
mean of the differences. If there were no effect-measure modification by L, 
these differences would be constant across strata, i.e., E[Y °=} — Y°=°|L] = 61 
where 6, would be the average causal effect in each stratum and also in the 
entire population. Our structural model for the conditional causal effect would 
be E[Y* — Y2=°|L] = Bia. 

More generally, there may be effect modification by L. For example, the 
causal effect of smoking cessation may be greater among heavy smokers than 
among light smokers. To allow for the causal effect to depend on L we can add a 
product term to the structural model, i.e., E[Y*-Y*|L] = 8,a+2aL, where 
b2 is a vector of parameters. Under conditional exchangeability Y°1L A|L, the 
conditional effect will be the same in the treated and in the untreated because 
the treated and the untreated are, on average, the same type of people within 
levels of L. Thus, under exchangeability, the structural model can also be 
written as 

E[Y° = y°=0]A =a, L] = Bya + Boal 


which is referred to as a structural nested mean model. The parameters 3, and 
Bo (again, a vector), which are estimated by g-estimation, quantify the average 
causal effect of smoking cessation A on Y within levels of A and L. 

In Chapter 13 we considered parametric models for the mean outcome Y 
that, like structural nested models, were also conditional on treatment A and 
covariates L. Those outcome models were the basis for standardization when 
estimating the parametric g-formula. In contrast with those parametric mod- 
els, structural nested models are semiparametric because they are agnostic 
about both the intercept and the main effect of L—that is, there is no para- 
meter Bo and no parameter (3 for a term 83L. As a result of leaving these 
parameters unspecified, structural nested models make fewer assumptions and 
can be more robust to model misspecification than the parametric g-formula. 
See Fine Point 14.1 for a description of the relation between structural nested 
models and the marginal structural models of Chapter 12. 

In the presence of censoring, our causal effect of interest is not E[Y¢=! — 
Y*=°| A, L] but E[Y¢=1:® — y4=0:c=0] A, L], that is, the average causal effect 
if everybody had remained uncensored. Estimating this difference requires 
adjustment for both confounding and selection bias (due to censoring C = 1) 
for the effect of treatment A. As described in the previous two chapters, IP 
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Fine Point 14.1 


Relation between marginal structural models and structural nested models. Consider a marginal structural mean 
model for the average outcome under treatment level a within levels of the dichotomous covariate V, a component of 
L, 

E[Y“|V] = Bo + Bya + BoaV + B3V 


The sum 3; + 62v is the average causal effect E[Y°=!—Y°=°|V = v] in the stratum V = v, and the sum 6o + 83v is the 


mean counterfactual outcome under no treatment E[Y°=°|V = v] in the stratum V = v. Suppose the only inferential 
goal is the average causal effect 6, + G20, i.e., we are not interested in estimating 8) + 63v = E[Y°=°|V = v]. Then 
we would write the model as E[Y°|V] = E[Y°=°|V] + Bia + BoaV or, equivalently, as 


E[Y* — Y° |V] = Bia + bza V 


which is referred to as a semiparametric marginal structural mean model because, unlike the marginal structural models 
in Chapter 12, it leaves the mean counterfactual outcomes under no treatment E[Y°=°|V] completely unspecified. 

To estimate the parameters of this semiparametric marginal structural model in the absence of censoring, we first 
create a pseudo-population with IP weights SW4(V) = f (A|V)/f(A|L). In this pseudo-population there is only 
confounding by V and therefore the semiparametric marginal structural model is a structural nested model whose para- 
meters are estimated by g-estimation with V substituted by L and each individual's contribution weighted by SW4(V). 
Therefore, in settings without time-varying treatments, structural nested models are identical to semiparametric mar- 
ginal structural models that leave the mean counterfactual outcomes under no treatment unspecified. Because marginal 
structural mean models include more parameters than structural nested mean models, the latter may be more robust to 
model misspecification. 

Consider the special case of a semiparametric marginal structural mean model within levels of all variables in L, 
rather than only a subset V so that SW4(V) are equal to 1 for all individuals. That is, let us consider the model 
E[Y¢—Y*°|L] = 81a+82aL, which we refer to as a faux semiparametric marginal structural model. Under conditional 
exchangeability, this model is the structural nested mean model we use in this chapter. 





weighting and standardization can be used to adjust for these two biases. G- 

estimation, on the other hand, can only be used to adjust for confounding, not 

selection bias. 
Technically, IP weighting is not nec- Thus, when using g-estimation, one first needs to adjust for selection bias 
essary to adjust for selection bias due to censoring by IP weighting. In practice, this means that we first estimate 
when using g-estimation with a nonstabilized IP weights for censoring to create a pseudo-population in which 
time-fixed treatment that does not nobody is censored, and then apply g-estimation to the pseudo-population. 
affect any variable in L, and an In our smoking cessation example, we will use the nonstabilized IP weights 
outcome measured at a single time W° =1/Pr[C = 0|L, A] that we estimated in Chapter 12. Again we assume 
point. That is, if as we have been that the vector of variables L is sufficient to adjust for both confounding and 
assuming Y“1L(A,C)|Z, we can selection bias. 


apply g-estimation to the uncen- All the g-estimation analyses described in this chapter incorporate IP weights 
sored subjects without having toIP to adjust for the potential selection bias due to censoring. Under the assump- 
weight. tion that the censored and the uncensored are exchangeable conditional on the 


measured covariates L, the structural nested mean model E[Y* — Y°=°|A = 
a, L| = 61a + BgaL, when applied to the pseudo-population created by the IP 
weights W, is really a structural model in the absence of censoring: 


Eye? _ yor A =a, L] = bia + BoaL 


For simplicity, we will omit the superscript c = 0 hereafter in this chapter. 


14.4 Rank preservation 175 





Technical Point 14.1 


Multiplicative structural nested mean models. In the text we only consider additive structural nested mean models. 
When the outcome variable Y can only take positive values, a multiplicative structural nested mean model is preferred. 
An example of a multiplicative structural nested mean model is 


E[Y°|A =a, L] 
l c = 
(aren = =) MEERES 
which can be fit by g-estimation with H(t) defined to be Y exp |-vle — what]. 


The above multiplicative model can also be used for binary (0, 1) outcome variables as long as the probability of 
Y = 1 is small in all strata of L. Otherwise, the model might predict probabilities greater than 1. If the probability is 
not small, one can consider a structural nested logistic model for a dichotomous outcome Y such as 


logit Pr[Y* = 1|A = a, L] — logit Pr[Y*~® = 1A = a, L] = Bia + Boal 


Unfortunately, structural nested logistic models do not generalize easily to time-varying treatments and their parameters 
cannot be estimated using the g-estimation algorithm described in the text. For details, see Robins (1999) and Tchetgen 
Tchetgen and Rotnitzky (2011). 





In this chapter we will use g-estimation of a structural nested mean model 
Unlike IP weighting, g-estimation to estimate the effect of the dichotomous treatment “smoking cessation”, but 
cannot be easily extended to es- structural nested models can also be used for continuous treatment variables— 
timate the parameters of struc- like “change in smoking intensity” (see Chapter 12). For continuous variables, 
tural logistic models for dichoto- the model needs to specify the dose-response function for the effect of treatment 
mous outcomes. See Technical A on the mean outcome Y. For example, E[Y* — Y°°|A = a, L] = Bia + 
Point 14.1. Boa? + BsaL + Baa? L, or E[Y* — Y°=°|A = a, L] could be a smooth function, 
e.g., splines, of A and L. 
We now turn our attention to the concept of rank preservation, which will 
help us describe g-estimation of structural nested models. 


14.4 Rank preservation 


In our smoking cessation example, all individuals can be ranked according to 

CODE: Program 14.1 the value of their observed outcome Y. Subject 23522 is ranked first with 
weight gain of 48.5 kg, individual 6928 is ranked second with weight gain 47.5 
kg... and individual 23321 is ranked last with weight gain of —41.3 kg. Simi- 
larly we could think of ranking all individuals according to the value of their 
counterfactual outcome under treatment Y°=! if the value of Y°=! were known 
for all individuals rather than only for those who were actually treated. Sup- 
pose for a second that we could actually rank everybody according to Y°=! and 
also according to Y°=°. We would then have two lists of individuals ordered 
from larger to smaller value of the corresponding counterfactual outcome. If 
both lists are in identical order we say that there is rank preservation. 

When the effect of treatment A on the outcome Y is exactly the same, 
on the additive scale, for all individuals in the study population, we say that 
additive rank preservation holds. For example, if smoking cessation increases 
everybody’s body weight by exactly 3 kg, then the ranking of individuals ac- 
cording to Y°=° would be equal to the ranking according to Y°='!, except 
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that in the latter list all individuals will be 3 kg heavier. A particular case of 
additive rank preservation occurs when the sharp null hypothesis is true (see 
Chapter 1), i.e., if treatment has no effect on the outcomes of any individual in 
the study population. For the purposes of structural nested mean models we 
will care about additive rank preservation within levels of L. This conditional 
additive rank preservation holds if the effect of treatment A on the outcome Y 
is exactly the same for all individuals with the same values of L. 

An example of an (additive conditional) rank-preserving structural model 

is 

Y°—Y2 = a+ aL, for all individuals i 
where 71 + wel is the constant causal effect for all individuals with covariate 
values L = l. That is, for every individual i with L = l, the value of Y2= is 
equal to Y°=° + yı + Yel. An individual’s counterfactual outcome under no 
treatment Y;°~° is shifted by Yı + Yə2l to obtain the value of her counterfactual 
outcome under treatment. 

Figure 14.1 shows an example of additive rank preservation within the 
stratum L = l. The bell-shaped curves represent the distribution of the coun- 
terfactual outcomes Y¢=° (left curve) and Y¢=! (right curve). The two dots in 
the upper part of the figure represent the values of the two counterfactual out- 
comes for individual i, and the two dots in the lower part represent the values 
of the two counterfactual outcomes for individual j. The arrows represent the 
shifts from Y°~° to Y*=!, which are equal to Y1 +l for all individuals in this 
stratum. Figure 14.2 shows an example of rank preservation within another 
stratum L = l’. The distribution of the counterfactual outcomes is different 
from that in stratum L = l. For example, the mean of Y°~® in Figure 14.1 is 
to the left of the mean of Y°~® in Figure 14.2, which means that, on average, 
individuals in stratum L = l have a smaller weight gain under no smoking 
cessation than individuals in stratum L = I’. The shift from Y°=° to Y¢=! is 
pı + wel’ for all individuals with L = l’, as shown for individuals p and q. 

For most treatments and outcomes, the individual causal effect is not ex- 
pected to be constant—not even approximately constant—across individuals 
with the same covariate values, and thus (additive conditional) rank preserva- 
tion is scientifically implausible. In our example we do not expect that smoking 
cessation affects equally the body weight of all individuals with the same val- 
ues of L. Some people are—genetically or otherwise—more susceptible to the 
effects of smoking cessation than others, even within levels of the covariates 
L. The individual causal effect of smoking cessation will vary across people: 
after quitting smoking some individuals will gain a lot of weight, some will 
gain little, and others may even lose some weight. Reality may look more like 
the situation depicted in Figure 14.3, in which the shift from Y=? to Y?>! 
varies across individuals with the same covariate values, and even ranks are 
not preserved since the outcome for individual 7 is less than that for individual 
j when a = 0 but not when a = 1. 

Because of the implausibility of rank preservation, one should not generally 
use methods for causal inference that rely on it. In fact none of the methods 
we consider in this book require rank preservation. For example, the marginal 
structural mean models from Chapter 12 are models for average causal effects, 
not for individual causal effects, and thus they do not assume rank preservation. 
The estimated average causal effect of smoking cessation on weight gain was 
3.5 kg (95% confidence interval: 2.5, 4.5). This average effect is agnostic as 
to whether rank preservation of individual causal effects holds. Similarly, the 
structural nested mean model in the previous section made no assumptions 
about rank preservation. 


14.5 G-estimation 


A structural nested mean model 
is well defined in the absence of 
rank preservation. For example, 
one could propose a model for the 
setting depicted in Figure 14.3 to 
estimate the average causal effect 
within strata of L. Such aver- 
age causal effect will generally differ 
from the individual causal effects. 
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The additive rank-preserving model in this section makes a much stronger 
assumption than non-rank-preserving models: the assumption of constant treat- 
ment effect for all individuals with the same value of L. There is no reason 
why we would want to use such an unrealistic rank-preserving model in prac- 
tice. And yet we use it in the next section to introduce g-estimation because 
g-estimation is easier to understand for rank-preserving models, and because 
the g-estimation procedure is actually the same for rank-preserving and non- 
rank-preserving models. Note that the (conditional additive) rank-preserving 
structural model is a structural mean model—the mean of the individual shifts 
from Y=? to Y=! is equal to each of the individual shifts within levels of L. 


This section links the material in the previous three sections. Suppose the 
goal is estimating the parameters of the structural nested mean model E[Y? — 
Y*|A = a, L] = Bia. For simplicity, we first consider a model with a single 
parameter 81. Because the model lacks product terms S2aL, we are effectively 
assuming that the average causal effect of smoking cessation is constant across 
strata of L, i.e., no additive effect modification by L. 

We also assume that the additive rank-preserving model Y} — Y,°~° = qa 
is correctly specified for all individuals i. Then the individual causal effect 1 
is equal to the average causal effect 8, in which we are interested. We write 
the rank-preserving model as Y* — Y°=° = ya, without a subscript i to index 
individuals because the model is the same for all individuals. For reasons that 
will soon be obvious, we write the model in the equivalent form 


ya=0 — Yet wa 


The first step in g-estimation is linking the model to the observed data. To 
do so, remember that an individual’s observed outcome Y is, by consistency, 
the counterfactual outcome Y°~! if the person received treatment A = 1 or 
the counterfactual outcome Y°~° if the person received no treatment A = 0. 
Therefore, if we replace the fixed value a in the structural model by each 
individual’s value A—which will be 1 for some and 0 for others—then we can 
replace the counterfactual outcome Y“ by the individual’s observed outcome 
Y4 = Y. The rank-preserving structural model then implies an equation 
in which each individual’s counterfactual outcome Y°~° is a function of his 
observed data on treatment and outcome and the unknown parameter 7: 


YO -Y -pA 


If this model were correct and we knew the value of 7; then we could calcu- 
late the counterfactual outcome under no treatment Y°~° for each individual 
in the study population. But we don’t know 7 . Estimating it is precisely the 
goal of our analysis. 

Let us play a game. Suppose a friend of yours knows the value of Yı but he 
only tells you that yı is one of the following: Yt = —20, yt = 0, or yt = 10. 
He challenges you: “Can you identify the true value %ı among the 3 possible 
values t?” You accept the challenge. For each individual, you compute 


Hy) =Y -4A 
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Rosenbaum (1987) proposed a ver- 
sion of this procedure for non-time- 
varying treatments. 


Important: G-estimation does not 
test whether conditional exchange- 
ability holds; it assumes that condi- 
tional exchangeability holds. 


CODE: Program 14.2 


We calculated the P-value from 
a Wald test. Any other valid 
test may be used. For exam- 
ple, we could have used a Score 
test, which simplifies the calcula- 
tions (it doesn’t require fitting mul- 
tiple models) and, in large samples, 
is essentially equivalent to a Wald 
test. 
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for each of the three possible values yt. The newly created variables H(—20), 
(0), and H(10) are candidate counterfactuals. Only one of them is the coun- 
terfactual outcome Y°=°. More specifically, H(wt) = Y*=° if y? = yı. In 
this game, choosing the correct value of pı is equivalent to choosing which 
one of the three candidate counterfactuals H(Yt) is the true counterfactual 
Ye = H(y,). Can you think of a way to choose the right H (yt)? 

Remember from Section 14.2 that the assumption of conditional exchange- 
ability can be expressed as a logistic model for treatment given the counterfac- 
tual outcome and the covariates L. When conditional exchangeability holds, 
the parameter a; for the counterfactual outcome should be zero. So we have 
a simple method to choose the true counterfactual out of the three variables 
H(w'). We fit three separate logistic models 


logit Pr[A = 1H (4t), L] = ao + ai H(y") + a2L, 


one per each of the three candidates H(t). The candidate H(t) with a; = 0 
is the counterfactual Y°~°, and the corresponding 7" is the true value y1. For 
example, suppose that H(t = 10) is unassociated with treatment A given 
the covariates L. Then our estimate wy of Yı is 10. We are done. That was 
g-estimation. 

In practice, however, we need to g-estimate the parameter ~ in the absence 
of a friend who knows the right answer and likes to play games. Therefore we 
will need to search over all possible values t until we find the one that results 
in an H(yt) with a; = 0. Because not all possible values can be tested—there 
is an infinite number of values 7)? in any given interval—we can conduct a fine 
search over the possible range of yt values (e.g., from —20 to 20 by increments 
of 0.01). The finer the search, the closer to the true estimate wy we will get, 
but also the greater the computational demands. 

In our smoking cessation example, we first computed each individual’s value 
of the 31 candidates H(2.0), H(2.1), H(2.2), ...H (4.9), and H(5.0) for values 
wt between 2.0 and 5.0 by increments of 0.1. We then fit 31 separate logistic 
models for the probability of smoking cessation. These models were exactly 
like the one used to estimate the denominator of the IP weights in Chapter 
12, except that we added to each model one of the 31 candidates H (yt). 
The parameter estimate @, for H(wW') was closest to zero for values H(3.4) 
and H(3.5). A finer search found that the minimum value of @, (which was 
essentially zero) was for H(3.446). Thus, our g-estimate a, of the average 
causal effect ~, = 6, of smoking cessation on weight gain is 3.4 kg. 

To compute a 95% confidence interval around our g-estimate of 3.4, we used 
the P-value for a test of a; = 0 in the logistic models fit above. As expected, 
the P-value was 1—it was actually 0.998—for yt = 3.446, which is the value 
yt that results in a candidate H(t) with a parameter estimate @; = 0. Of 
the 31 logistic models that we fit for Yt values between 2.0 and 5.0, the P-value 
was greater than 0.05 in all models with H (Yt) based on yt values between 
approximately 2.5 and 4.5. That is, the test did not reject the null hypothesis 
at the 5% level for the subset of yt values between 2.5 and 4.5. By inverting 
the test results, we concluded that the limits of the 95% confidence interval 
around 3.4 are 2.5 and 4.5. Another option to compute the 95% confidence 
interval is bootstrapping of the g-estimation procedure. 

More generally, the 95% confidence interval for a g-estimate is determined 
by finding the set of values of t that result in a P-value> 0.05 when testing for 
a, = 0. The 95% confidence interval is obtained by inversion of the statistical 
test for a, = 0, with the limits of the 95% confidence interval being the limits 
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Fine Point 14.2 


Sensitivity analysis for unmeasured confounding. G-estimation relies on the fact that a; = 0 if conditional 
exchangeability given L holds. Now consider a setting in which conditional exchangeability does not hold. For example, 
suppose that the probability of quitting smoking A is lower for individuals whose spouse is a smoker, and that the 
spouse's smoking status is associated with important determinants of weight gain Y not included in L. That is, 
there is unmeasured confounding by spouse's smoking status. Because now the variables in L are insufficient to achieve 
exchangeability of the treated and the untreated, the treatment A and the counterfactual Y°~° are associated conditional 
on L. That is, aj #0 and we cannot apply g-estimation as described in the main text. 

But g-estimation does not require that a; = 0. Suppose that, because of unmeasured confounding by the spouse’s 
smoking status, a; is expected to be 0.1 rather than 0. Then we can apply g-estimation as described in the text 
except that we will test whether a, = 0.1 rather than whether a, = 0. G-estimation does not require that conditional 
exchangeability given L holds, but that the magnitude of nonexchangeability—the value of a;—is known. This property 
of g-estimation can be used to conduct sensitivity analyses for unmeasured confounding. 

If we believe that L may not sufficiently adjust for confounding, then we can repeat our g-estimation analysis under 
different scenarios of unmeasured confounding, represented by a range of values of a 1, and plot the effect estimates 
under each of them. Such plot shows how sensitive our effect estimate is to unmeasured confounding of different 
direction and magnitude. One practical problem for this approach is how to quantify the unmeasured confounding on 
the ay scale, e.g., is 0.1 a lot of unmeasured confounding? Robins, Rotnitzky, and Scharfstein (1999) provide technical 
details on sensitivity analysis for unmeasured confounding using g-estimation. 





of the set of values yt with P-value> 0.05. In our example, the statistical test 
In the presence of censoring, the fit was based on a robust variance estimator because of the use of IP weighting to 
of the logistic models is necessar- adjust for censoring. Therefore our 95% confidence interval is conservative in 
ily restricted to uncensored individ- large samples, i.e., it will trap the true value at least 95% of the time. In large 
uals (C = 0), and the contribution samples, bootstrapping would result in a non-conservative, and thus possibly 
of each individual is weighted by narrower, 95% confidence interval for the g-estimate. 
the estimate of his/her IP weight Back to non-rank-preserving models. The g-estimation algorithm (i.e., the 
SW. See Technical Point 14.2. computer code implementing the procedure) for Yı produces a consistent es- 
timate of the parameter ĝı of the mean model, assuming the mean model is 
correctly specified (that is, if the average treatment effect is equal in all levels 
of L). This is true regardless of whether the individual treatment effect is 
constant, that is, regardless of whether the conditional additive rank preser- 
vation holds. In other words, the validity of the g-estimation algorithm does 
not actually require that H(8,) = Y°~° for all individuals, where 6, is the 
parameter value in the mean model. Rather, the algorithm only requires that 
H(61) and Y*~° have the same conditional mean given L. 
Interestingly, the above g-estimation procedure can be readily modified to 
incorporate a sensitivity analysis for unmeasured confounding, as described in 
Fine Point 14.2. 


14.6 Structural nested models with two or more parameters 


We have so far considered a structural nested mean model with a single pa- 
rameter 61. The lack of product terms 62aL implies that we believe that the 
average causal effect of smoking cessation does not vary across strata of L. The 
structural nested model will be misspecified—and thus our causal inferences 
will be wrong—if there is indeed effect modification by some components V of 
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As discussed in Chapter 12, a de- 
sirable property of marginal struc- 
tural models is null preservation: 
when the null hypothesis of no aver- 
age causal effect is true, the model 
is never misspecified. Structural 
nested models preserve the null too. 
In contrast, although the paramet- 
ric g-formula preserves the null for 
time-fixed treatments, it loses this 
property in the time-varying setting 
(see Part III). 


The Nelder-Mead Simplex method 
is an example of a directed search 
method. 


CODE: Program 14.3 


You may argue that structural 
nested models with multiple para- 
meters may not be necessary. If 
all variables L are discrete and the 
study population is large, one could 
fit separate 1-parameter models to 
each subset of the population de- 
fined by joint levels of the covari- 
ates contained in the vector L. 
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L but we failed to add a product term B2aV. This is in contrast with the sat- 
urated marginal structural model E[Y“] = o + 61a, which is not misspecified 
if we fail to add terms G2aV and 83V even if there is effect modification by V. 
Marginal structural models that do not condition on V estimate the average 
causal effect in the population, whereas those that condition on V estimate the 
average causal effect within levels of V. Structural nested models estimate, by 
definition, the average causal effect within levels of the confounders L, not the 
average causal effect in the population. Omitting product terms in structural 
nested models when there is effect modification will generally lead to bias due 
to model misspecification. 

Fortunately, the g-estimation procedure described in the previous section 
can be generalized to models with product terms. For example, suppose we be- 
lieve that the average causal effect of smoking cessation depends on the baseline 
level of smoking intensity V. We may then consider the structural nested mean 
model E[Y*? — Y°=°|A = a, L] = 81a + BoaV and, for g-estimation purposes, 
the corresponding rank-preserving model Y,* — Y,°=° = pia + WeaV;. Because 
the structural model has two parameters, ~, and p2, we also need to include 
two parameters in the IP weighted logistic model for Pr[A = 1|H (4t), L] with 


yi = (ut, wh). For example, we could fit the logistic model 


logit Pr[A = 1|H(v"), L] = ao +a H (Yt) + a2H(W)V + aL 


and find the combination of values of y? and yÅ that result in a H(ọt) that is 
independent of treatment A conditional on the covariates L. That is, we need 
to search the combination of values wl and wh that make both a; and a2 equal 
to zero. 

Because the model has two parameters, the search must be conducted over 
a two-dimensional space. Thus a systematic, brute force search will be more 
involved than that described in the previous section. Less computationally in- 
tensive approaches, known as directed search methods, for approximate search- 
ing are available in statistical software. For linear mean models like the one 
discussed here—but not, for example, for certain survival analysis models— 
the estimate can be directly calculated using a formula, i.e., the estimator has 
closed form and a search over the possible values of the parameters is not 
necessary (see Technical Point 14.2 for details). In our smoking cessation ex- 
ample, the g-estimates were wy = 2.86 and we = 0.03. The corresponding 95% 
confidence intervals can be calculated by using the P-value of a joint test for 
Q 1 = a2 = 0 or, more simply, by bootstrapping. 

In the more general case, we would consider a model that allows the average 
causal effect of smoking cessation to vary across all strata of the variables 
in L. For dichotomous variables, the corresponding rank-preserving model 
Y¢-YS" = yia + aj- PzjLij has p + 1 parameters Y1, Y21,...Y2p, where 
paj is the parameter corresponding to the product term aL; and Lj represents 
one of the p components of L. The average causal effect in the entire study 
population can then be calculated as wy, + DD Jsi WPojLij, where n is the 
number of individuals in the study. In practice, structural nested models with 
multiple parameters have rarely been used. 

In fact, structural nested models of any type have rarely been used, partly 
because of the lack of user-friendly software and partly because the extension 
of these models to survival analysis requires some additional considerations 
(see Chapter 17). We now review two methods that are arguably the most 
commonly used approaches to adjust for confounding: outcome regression and 
propensity scores. 
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Technical Point 14.2 


G-estimation of structural nested mean models. Consider the structural nested model E[Y* — Y°=°|A = a, L] = 
B,a. A consistent estimate of 3; can be obtained by g-estimation under the assumptions described in the text. 
Specifically, our estimate of 6, is the value of H (yt) that minimizes the association between H(i!) and A. When we 
base our g-estimate on the score test (see, for example, Casella and Berger 2002), this procedure is equivalent to finding 
the parameter value yÝ that solves the estimating equation 


Si [C; = 0] WO H; (4t) (A; — E[A|L;]) = 0 


where the indicator I[C; = 0] takes value 1 for individual i if C; = 0 and takes value 0 otherwise, and the IP weight 
W£ and the expectation E[A|L;] = Pr [A = 1|L;] are replaced by their estimates. E[A|L;] can be estimated from a 
logistic model for treatment conditional on the covariates L in which individual 7 contribution is weighted by W if 
C; = 0 and it is zero otherwise. [Because A and L are observed on all individuals, we could also estimate E[A|L,| by 
an unweighted logistic regression of A on L using all individuals.] 

The solution to the equation has a closed form and can therefore be calculated directly, i.e., no search over the 


parameter space is required. Specifically, using the fact that H;(Yt) = Y; — Wit A; we obtain that Yı equals 


5i [Ci = 0] WEY; (Ai — E [A]L:]) i [Ci = 0] WF A; (Ai — E [A]L:]) 


i=1 i=l 


If Y is D-dimensional, we multiply the left-hand side of the estimating equation by a D-dimensional vector function of L. 
The choice of the function affects the statistical efficiency of the estimator, but not its consistency. That is, although 
all choices of the function will result in valid confidence intervals, the length of the confidence interval will depend on 
the function. Robins (1994) provided a formal description of structural nested mean models, and derived the function 
that minimizes confidence interval length. 

A natural question is whether we can further increase efficiency by replacing H;(w1) by a nonlinear function, such 
as EAD in the above estimating equation and still preserve consistency of the estimate. Nonlinear functions of 
H;(w") cannot be used in our estimating equation for models that, like the structural nested mean models described in 
this chapter, impose only mean independence conditional on L, i.e., E[H(61)|A, L] = E [H (81)|L], for identification. 
Nonlinear functions of H;(w') can be used for models that impose distributional independence, i.e., H(31)1L AIL, like 
structural nested distribution models (not described in this chapter) that map percentiles of the distribution of Y“ given 
(A =a, L) into percentiles of the distribution of Y° given (A =a, L). 

The estimator of ~ is consistent only if the models used to estimate E[A|L] and Pr [C = 1A, L] are both correct. 
We can construct a more robust estimator by replacing H (yt) by H(t) —E [H(yt)|L] in the estimating equation, and 
then estimating the latter conditional expectation by fitting an unweighted linear model for E [H(#')|L] = E[Y°-°|L] 
among the uncensored individuals. If this model is correct then the estimate of w solving the modified estimating 
equation remains consistent even if both the above models for E [A|L] and Pr [C = 1|A, L] are incorrect. Thus we obtain 
a consistent estimator of y if either (i) the model for E [H(w")|L] or (ii) both models for E [A|L] and Pr [C = 1A, L] 
are correct, without knowing which of (i) or (ii) is correct. We refer to such an estimator as being doubly robust. Robins 
(2000) described a closed-form of the doubly robust estimator for the linear structural nested mean model. 
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Chapter 15 
OUTCOME REGRESSION AND PROPENSITY SCORES 


Outcome regression and various versions of propensity score analyses are the most commonly used parametric 
methods for causal inference. You may rightly wonder why it took us so long to include a chapter that discusses 
these methods. So far we have described IP weighting, standardization, and g-estimation—the g-methods. Pre- 
senting the most commonly used methods after the least commonly used ones seems an odd choice on our part. 
Why didn’t we start with the simpler and widely used methods based on outcome regression and propensity scores? 
Because these methods do not work in general. 

More precisely, the simpler outcome regression and propensity score methods—as described in a zillion pub- 
lications that this chapter cannot possibly summarize—work fine in simpler settings, but these methods are not 
designed to handle the complexities associated with causal inference with time-varying treatments. In Part III 
we will again discuss g-methods but will say less about conventional outcome regression and propensity score 
methods. This chapter is devoted to causal methods that are commonly used but have limited applicability for 


complex longitudinal data. 


15.1 Outcome regression 


Reminder: We defined the aver- 
age causal effect as E[Y¢=1¢=9] — 
E[y¢-°-°=°]. We assumed that 
exchangeability of the treated and 
the untreated was achieved condi- 
tional on the L variables sex, age, 
race, education, intensity and dura- 
tion of smoking, physical activity in 
daily life, recreational exercise, and 
weight. 


In Chapter 12, we referred to this 
model as a faux marginal structural 
model because it has the form of 
a marginal structural model but IP 
weighting is not required to esti- 
mate its parameters. The stabilized 
IP weights SW4(L) are all equal to 
1 because the model is conditional 
on the entire vector L rather than 
on a subset V of L. 


In the last three chapters we have described IP weighting, standardization, 
and g-estimation to estimate the average causal effect of smoking cessation 
(the treatment) A on weight gain (the outcome) Y. We also described how to 
estimate the average causal effect within subsets of the population, either by 
restricting the analysis to the subset of interest or by adding product terms in 
marginal structural models (Chapter 12) and structural nested models (Chap- 
ter 14). Take structural nested models. These models include parameters for 
the product terms between treatment A and the variables L, but no parame- 
ters for the variables L themselves. This is an attractive property of structural 
nested models because we are interested in the causal effect of A on Y within 
levels of L but not in the (noncausal) relation between L and Y. A method— 
g-estimation of structural nested models—that is agnostic about the functional 
form of the L-Y relation is protected from bias due to misspecifying this rela- 
tion. 

On the other hand, if we were willing to specify the L-Y association within 
levels of A, we would consider the structural model 


E[Y°°|L] = Bo + Bia + Boal + B3L 


where (2 and 3 are vector parameters. The average causal effects of smoking 
cessation A on weight gain Y in each stratum of L are a function of 6, and (2, 
the mean counterfactual outcomes under no treatment in each stratum of L 
are a function of o and 83. The parameter (3 is usually referred as the main 
effect of L, but the use of the word effect is misleading because 83 may not 
have an interpretation as the causal effect of L (there may be confounding for 
L). The parameter 63 simply quantifies how the mean of the counterfactual 
Y2-9.c=9 varies as a function of L, as we can see in our structural model. See 
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Fine Point 15.1 


Nuisance parameters. Suppose our goal is to estimate the causal parameters and (2. If we do so by fitting the 
outcome regression model E[Y%°°|L] = 89+ 61a+ 82a L+ B3L, our estimates of 3,and (32 will in general be consistent 
only if Bo + B3 L correctly models the dependence of the mean E[Y°=9-=°|L] on L. We refer to the parameters Bo and 
33 as nuisance parameters because they are not our parameters of primary interest. 

On the other hand, if we estimate 3, and b2 by g-estimation of the structural nested model E[Y%°9 —Y 2-99 L] = 
B,a+ BoaL, then our estimates of 3,and {2 will in general be consistent only if the conditional probability of treatment 
given L Pr[A = 1|L] is correct. That is, the parameters of the treatment model such as logit Pr[A = 1|L] = ao + ai L 
are now the nuisance parameters. 

For example, bias would arise in the outcome regression model if a covariate L is modeled with a linear term 83L 
when it should actually be linear and quadratic 83 +417. Structural nested models are not subject to misspecification 
of an outcome regression model because the L-Y relation is not specified in the structural model. However, bias would 
arise when using g-estimation of structural nested models if the L-A relation is misspecified in the treatment model. 
Symmetrically, outcome regression models are not subject to misspecification of a treatment model. For fixed treatments 
that do not vary over time, deciding what method to use boils down to deciding which nuisance parameters—those in 
the outcome model or in the treatment model—we believe can be more accurately estimated. When possible, a better 
alternative is to use doubly robust methods (see Fine Point 13.2). 


Fine Point 15.1 for a discussion of parameters that, like 89 and 83, do not have 
a causal interpretation. 

The counterfactual mean outcomes if everybody in stratum / of L had been 
treated and remained uncensored, E[Y¢=1°=°|L = Ij, are equal to the corre- 
sponding mean outcomes in the uncensored treated, E[Y|A =1,C = 0, L = l], 
under exchangeability, positivity, and well-defined interventions. And analo- 
gously for the untreated. Therefore the parameters of the above structural 
model can be estimated via ordinary least squares by fitting the outcome re- 
gression model 


E[Y|A, C= 0, L] = Qo + ajA + a2 AL + a3L 


as described in Section 13.2. Like stratification in Chapter 3, outcome regres- 
sion adjusts for confounding by estimating the causal effect of treatment in 
each stratum of L. If the variables L are sufficient to adjust for confounding 
(and selection bias) and the outcome model is correctly specified, no further 
adjustment is needed. That is, the parameters a of the regression model equal 
Bo and (3 specify the dependence the parameters 8 of the structural model. 
of Y2=9:c=0 on L, which is required In Section 13.2, outcome regression was an intermediate step towards the 
when the model is used to esti- estimation of a standardized outcome mean. Here, outcome regression is the 
mate (i) the mean counterfactual end of the procedure. Rather than standardizing the estimates of the condi- 
outcomes and (ii) the conditional tional means to estimate a marginal mean, we just compare the conditional 
(within levels of L) effect on the mean estimates. In Section 13.2, we fit a regression model with only one prod- 
multiplicative rather than additive uct term in 3 (between A and smoking intensity). That is, a model in which 
scale. we a priori set most product terms equal to zero. Using the same model as in 
Section 13.2, here we obtained the parameter estimates A= 2.6 and G2 = 0.05. 
As an example, the effect estimate E[Y|A = 1,C = 0, L]—E[Y|A =0,C = 0, L] 
CODE: Program 15.1 was 2.8 (95% confidence interval: 1.5, 4.1) for those smoking 5 cigarettes/day, 
and 4.4 (95% confidence interval: 2.8, 6.1) for 40 cigarettes/day. A common 
approach to outcome regression is to assume that there is no effect modification 
by any variable in L. Then the model is fit without any product terms and 
Bi is an estimate of both the conditional and marginal average causal effects 


15.2 Propensity scores 


15.2 Propensity scores 


CODE: Program 15.2 

Here we only consider propensity 
scores for dichotomous treatments. 
Propensity score methods, other 
than IP weighting and g-estimation 
and other related doubly-robust es- 
timators, are difficult to generalize 
to non-dichotomous treatments. 
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Figure 15.1 


In the study population, due 
to sampling variability, the true 
propensity score only approximately 
“balances” the covariates L. The 
estimated propensity score based 
on a correct model gives better bal- 
ance in general. 
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of treatment. In our example, a model without any product terms yielded the 
estimate 3.5 (95% confidence interval: 2.6, 4.3) kg. 

In this chapter we did not need to explain how to fit an outcome regression 
model because we had already done it in Chapter 13 when estimating the 
components of the parametric g-formula. It is equally straightforward to use 
outcome regression for discrete outcomes, e.g., for a dichotomous outcome Y 
one could fit a logistic model for Pr [Y = 1|A = a, C = 0, L]. 


When using IP weighting (Chapter 12) and g-estimation (Chapter 14), we 
estimated the probability of treatment given the covariates L, Pr [A = 1|L], 
for each individual. Let us refer to this conditional probability as m(L). The 
value of 1(L) is close to 0 for individuals who have a low probability of receiving 
treatment and is close to 1 for those who have a high probability of receiving 
treatment. That is, 7(L) measures the propensity of individuals to receive 
treatment given the information available in the covariates L. No wonder that 
m(L) is referred to as the propensity score. 

In an ideal randomized trial in which half of the individuals are assigned 
to treatment A = 1, the propensity score 7(L) = 0.5 for all individuals. Also 
note that 7(Z) = 0.5 for any choice of L. In contrast, in observational studies 
some individuals may be more likely to receive treatment than others. Be- 
cause treatment assignment is beyond the control of the investigators, the true 
propensity score 7(Z) is unknown, and therefore needs to be estimated from 
the data. 

In our example, we can estimate the propensity score (LZ) by fitting a 
logistic model for the probability of quitting smoking A conditional on the 
covariates L. This is the same model that we used for IP weighting and g- 
estimation. Under this model, individual 22941 was estimated to have the 
lowest estimated propensity score (0.053), and individual 24949 the highest 
(0.793). Figure 15.1 shows the distribution of the estimated propensity score 
in quitters A = 1 (bottom) and nonquitters A = 0 (top). As expected, those 
who quit smoking had, on average, a greater estimated probability of quitting 
(0.312) than those who did not quit (0.245). If the distribution of 7(L) were 
the same for the treated A = 1 and the untreated A = 0, then there would be 
no confounding due to L, i.e., there would be no open path from L to A ona 
causal diagram. 

Individuals with the same propensity score 7(L) will generally have different 
values of some covariates L. For example, two individuals with 7(L) = 0.2 
may differ with respect to smoking intensity and exercise, and yet they may 
be equally likely to quit smoking given all the variables in L. That is, both 
individuals have the same conditional probability of ending up in the treated 
group A = 1. If we consider all individuals with a given value of (LZ) in the 
super-population, this group will include individuals with different values of L 
(e.g., different values of smoking intensity and exercise), but the distribution 
of L will be the same in the treated and the untreated, that is, ALL L|m(L). 
We say the propensity score balances the covariates between the treated and 
the untreated. 

Of course, the propensity score only balances the measured covariates L, 
which does not prevent residual confounding by unmeasured factors. Random- 
ization balances both the measured and the unmeasured covariates, and thus 
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Technical Point 15.1 


Balancing scores and prognostic scores. As discussed in the text, the propensity score 7(L) balances the covariates 
between the treated and the untreated. In fact, the propensity score 7(L) is the simplest example of a balancing score. 
More generally, a balancing score b(L) is any function of the covariates L such that AIL L|b(L). That is, for each value 
of the balancing score, the distribution of the covariates L is the same in the treated and the untreated. Rosenbaum and 
Rubin (1983) proved that exchangeability and positivity based on the variables L implies exchangeability and positivity 
based on a balancing score b(L). If it is sufficient to adjust for L, then it is sufficient to adjust for a balancing score 
b(L), including the propensity score 7(L). The causal diagram in Figure 15.2 depicts the propensity score for the setting 
represented in Figure 7.1: the 7(Z) can be viewed as an intermediate node between L and A with a deterministic arrow 
from L to 7(L). By noting that 7(L) blocks all backdoor paths from A to L we have given a proof of the sufficiency 
of adjusting for (ZL). 

An alternative to a balancing score b(L) is a prognostic score s(L), i.e., a function of the covariates L such that 
Y2-° {| L|s(L). Adjustment methods can be developed for both balancing scores and prognostic scores, but methods for 
prognostic scores require stronger assumptions and cannot be readily extended to time-varying treatments. See Hansen 
(2008) and Abadie et al (2013) for a discussion of prognostic scores. 





it is the preferred method to eliminate confounding. See Technical Point 15.1 
for a formal definition of a balancing score. 
Like all methods for causal inference that we have discussed, the use of 


If L is sufficient to adjust for con- 
founding and selection bias, then 
m(L) is sufficient too. This result 
was derived by Rosenbaum and Ru- 
bin in a seminal paper published in 
1983. 


In a randomized experiment, the es- 
timated (ZL) adjusts for both sys- 
tematic and random imbalances in 
covariates, and thus does better 
than adjustment for the true 7(L) 
which ignores random imbalances. 


propensity score methods requires the identifying conditions of exchangeability, 
positivity, and consistency. The use of propensity score methods is justifed by 
the following key result: Exchangeability of the treated and the untreated 
within levels of the covariates L implies exchangeability within levels of the 
propensity score 7(L). That is, conditional exchangeability Y*_LA|L implies 
Y*1LA|a(L). Further, positivity within levels of the propensity score 7(L)— 
which means that no individual has a propensity score equal to either 1 or 
0—holds if and only if positivity within levels of the covariates L, as defined 
in Chapter 2, holds. 

Under exchangeability and positivity within levels of the propensity score 
m(L), the propensity score can be used to estimate causal effects using strat- 
ification (including outcome regression), standardization, and matching. The 
next two sections describe how to implement each of these methods. As a first 
step, we must start by estimating the propensity score 7(L) from the observa- 
tional data and then proceeding to use the estimated propensity score in lieu 
of the covariates L for stratification, standardization, or matching. 


15.3 Propensity stratification and standardization 


a 
L — nL) —>A—Y 


Figure 15.2 


The average causal effect among individuals with a particular value s of the 
propensity score m(L), i.e., E[Y¢=1°|n(L) = s] — E[¥2=9®|z(L) = s] is 
equal to E[Y|A = 1,C =0,2(L) = s| — E[Y |A = 0,C = 0, 2(L) = s| under the 
identifying conditions. This conditional effect might be estimated by restrict- 
ing the analysis to individuals with the value s of the true propensity score. 
However, the propensity score 7(Z) is generally a continuous variable that can 
take any value between 0 and 1. It is therefore unlikely that two individuals 
will have exactly the same value s. For example, only individual 22005 had 
an estimated 7(L) of 0.6563, which means that we cannot estimate the causal 
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CODE: Program 15.3 


Caution: the denominator of the 
IP weights for a dichotomous treat- 
ment A is not the propensity score 
m(L), but a function of 7(L). The 
denominator is (ZL) for the treated 
(A = 1) and 1 — x(L) for the un- 
treated (A = 0). 


Though the propensity score is one- 
dimensional, we still need to esti- 
mate it from a model that regresses 
treatment on a high-dimensional L. 
The same applies to IP weighting 
and g-estimation. 


CODE: Program 15.4 


effect among individuals with 7(L) = 0.6563 by comparing the treated and the 
untreated with that particular value. 

One approach to deal with the continuous propensity score is to create 
strata that contain individuals with similar, but not identical, values of 7(L). 
The deciles of the estimated (ZL) is a popular choice: individuals in the pop- 
ulation are classified in 10 strata of approximately equal size, then the causal 
effect is estimated in each of the strata. In our example, each decile contained 
approximately 162 individuals. The effect of smoking cessation on weight gain 
ranged across deciles from 0.0 to 6.6 kg, but the 95% confidence intervals 
around these point estimates were wide. 

We could have also obtained these effect estimates by fitting an outcome 
regression model for E[Y |A, C = 0, 7(L)] that included as covariates treatment 
A, 9 indicators for the deciles of the estimated 7(L) (one of the deciles is the 
reference level and is already incorporated in the intercept of the model), and 
9 product terms between A and the indicators. Most applications of outcome 
regression with deciles of the estimated 7(L) do not include the product terms, 
i.e., they assume no effect modification by a(Z). In our example, a model 
without product terms yields an effect estimate of 3.5 kg (95% confidence 
interval: 2.6, 4.4). See Fine Point 15.2 for more on effect modification by the 
propensity score. 

Stratification on deciles or other functions of the propensity score raises a 
potential problem: in general the distribution of the continuous 7(L) will differ 
between the treated and the untreated within some strata (e.g., deciles). If, for 
example, the average 7(L) were greater in the treated than in the untreated 
in some strata, then the treated and the untreated might not be exchangeable 
in those strata. This problem did not arise in previous chapters, when we 
used functions of the propensity score to estimate the parameters of structural 
models via IP weighting and g-estimation, because those methods used the 
numerical value of the estimated probability rather than a categorical transfor- 
mation like deciles. Similarly, the problem does not arise when using outcome 
regression for E[Y|A, C = 0,7(L)] with the estimated propensity score 7(L) as 
a continuous covariate rather than as a set of indicators. When we used this 
latter approach in our example the effect estimate was 3.6 (95% confidence 
interval: 2.7, 4.5) kg. 

The validity of our inference depends on the correct specification of the 
relationship between 7(L) and the mean outcome Y (which we assumed to be 
linear). However, because the propensity score is a one-dimensional summary 
of the multi-dimensional L, it is easy to guard against misspecification of this 
relationship by fitting flexible models, e.g., cubic splines rather than a single 
linear term for the propensity score. Note that IP weighting and g-estimation 
were agnostic about the relationship between propensity score and outcome. 

When our parametric assumptions for E[Y|A,C = 0,2(L)] are correct, 
plus exchangeability and positivity hold, the model estimates the average 
causal effects within all levels s of the propensity score E[Y¢=°°|r(L) = 
s] — E[Y2=°.-=°|z(L) = s]. If we were interested in the average causal effect in 
the entire study population E[Y¢=!°=°] — E[Y*=°-=°], we would standardize 
the conditional means E[Y|A,C = 0,7(L)] by using the distribution of the 
propensity score. The procedure is the same one described in Chapter 13 for 
continuous variables, except that we replace the variables L by the estimated 
m(L). Note that the procedure can naturally incorporate a product term be- 
tween treatment A and the estimated m(L) in the outcome model. In our 
example, the standardized effect estimate was 3.6 (95% confidence interval: 
2.7, 4.6) kg. 
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15.4 Propensity matching 


After propensity matching, the 
matched population has the 7(L) 
distribution of the treated, of the 
untreated, or any other arbitrary 
distribution. 


A drawback of matching used to be 
that nobody knew how to compute 
the variance of the effect estimate. 
That is no longer the case thanks 
to the work of Abadie and Imbens 
(2006). 


Remember: positivity is now de- 
fined within levels of the propensity 
score, i.e., Pr[A =aln(L) = s| > 
0 for all s such that Pr [r(L) = s] 
is nonzero. 


Outcome regression and propensity scores 


The process of matching on the propensity score 7(L) is analogous to match- 
ing on a single continuous variable L, a procedure described in Chapter 4. 
There are many forms of propensity matching. All of them attempt to form 
a matched population in which the treated and the untreated are exchange- 
able because they have the same distribution of 7(Z). For example, one can 
match the untreated to the treated: each treated individual is paired with one 
(or more) untreated individuals with the same propensity score value. The 
subset of the original population comprised by the treated-untreated pairs (or 
sets) is the matched population. Under exchangeability and positivity given 
m(L), association measures in the matched population are consistent estimates 
of effect measures, e.g., the associational risk ratio in the matched population 
consistently estimates the causal risk ratio in the matched population. 

Again, it is unlikely that two individuals will have exactly the same val- 
ues of the propensity score 7(L). In our example, propensity score matching 
will be carried out by identifying, for each treated individual, one (or more) 
untreated individuals with a close value of r(L). A common approach is to 
match treated individuals with a value s of the estimated m(L) with untreated 
individuals who have a value s + 0.05, or some other small difference. For ex- 
ample, treated individual 1089 (estimated m(L) of 0.6563) might be matched 
with untreated individual 1088 (estimated 7(L) of 0.6579). There are numer- 
ous ways of defining closeness, and a detailed description of these definitions 
is beyond the scope of this book. 

Defining closeness in propensity matching entails a bias-variance trade- 
off. If the closeness criteria are too loose, individuals with relatively different 
values of a(L) will be matched to each other, the distribution of 7(L) will 
differ between the treated and the untreated in the matched population, and 
exchangeability will not hold. On the other hand, if the closeness criteria are 
too tight and many individuals are excluded by the matching procedure, there 
will be approximate exchangeability but the effect estimate may have wider 
95% confidence intervals. 

The definition of closeness is also related to that of positivity. In our smok- 
ing cessation example, the distributions of the estimated 7(L) in the treated 
and the untreated overlapped throughout most of the range (see Figure 15.1). 
Only 2 treated individuals (0.01% of the study population) had values greater 
than those of any untreated individual. When using outcome regression on the 
estimated 7(L) in the previous section, we effectively assumed that the lack 
of untreated individuals with high 7(L) estimates was due to chance—random 
nonpositivity—and thus included all individuals in the analysis. In contrast, 
most propensity matched analyses would not consider those two treated indi- 
viduals close enough to any of the untreated individuals, and would exclude 
them. Matching does not distinguish between random and structural nonpos- 
itivity. 

The above discussion illustrates how the matched population may be very 
different from the target (super)population. In theory, propensity matching 
can be used to estimate the causal effect in a well characterized target pop- 
ulation. For example, when matching each treated individual with one or 
more untreated individuals and excluding the unmatched untreated, one is es- 
timating the effect in the treated (see Fine Point 15.2). In practice, however, 
propensity matching may yield an effect estimate in a hard-to-describe subset 
of the study population. For example, under a given definition of closeness, 
some treated individuals cannot be matched with any untreated individuals 
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Even if every subject came with 
her propensity score tattooed on 
her forehead, the population could 
still be ill-characterized because the 
same propensity score value may 
mean different things in different 
settings. 


and thus they are excluded from the analysis. As a result, the effect estimate 
corresponds to a subset of the population that is defined by the values of the 
estimated propensity score that have successful matches. 

That propensity matching forces investigators to restrict the analysis to 
treatment groups with overlapping distributions of the estimated propensity 
score is often presented as a strength of the method. One surely would not want 
to have biased estimates because of violations of positivity, right? However, 
leaving aside issues related to random variability (see above), there is a price 
to be paid for restrictions based on the propensity score. Suppose that, after 
inspecting Figure 15.1, we conclude that we can only estimate the effect of 
smoking cessation for individuals with an estimated propensity score less than 
0.67. Who are these people? It is unclear because individuals do not come with 
a propensity score tattooed on their forehead. Because the matched population 
is not well characterized, it is hard to assess the transportability of the effect 
estimate to other populations. 

When positivity concerns arise, restriction based on real-world variables 
(e.g., age, number of cigarettes) leads to a more natural characterization of the 
causal effect. In our smoking cessation example, the two treated individuals 
with estimated (L) > 0.67 were the only ones in the study who were over 
age 50 and had smoked for less than 10 years. We could exclude them and 
explain that our effect estimate only applies to smokers under age 50 and to 
smokers 50 and over who had smoked for at least 10 years. This way of defining 
the target population is more natural than defining it as those with estimated 
m(L) < 0.67. 

Using propensity scores to detect the overlapping range of the treated and 
the untreated may be useful, but simply restricting the study population to 
that range is a lazy way to ensure positivity. The automatic positivity ensured 
by propensity matching needs to be weighed against the difficulty of assessing 
transportability when restriction is solely based on the value of the estimated 
propensity scores. 


15.5 Propensity models, structural models, predictive models 


In Part II of this book we have described two different types of models for causal 
inference: propensity models and structural models. Let us now compare them. 
Propensity models are models for the probability of treatment A given 
the variables L used to try to achieve conditional exchangeability. We have 
used propensity models for matching and stratification in this chapter, for IP 
weighting in Chapter 12, and for g-estimation in Chapter 14. The parameters 
of propensity models are nuisance parameters (see Fine Point 15.1) without a 
causal interpretation because a variable L and treatment A may be associated 
for many reasons—not only because the variable L causes A. For example, 
the association between L and A can be interpreted as the effect of L on A 
under Figure 7.1, but not under Figure 7.2. Yet propensity models are useful 
for causal inference, often as the basis of the estimation of the parameters of 
structural models, as we have described in this and previous chapters. 
Structural models describe the relation between the treatment A and some 
component of the distribution (e.g., the mean) of the counterfactual outcome 
Y°, either marginally or within levels of the variables L. For continuous treat- 
ments, a structural model is often referred to as a dose-response model. The 
parameters for treatment in structural models are not nuisance parameters: 
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Fine Point 15.2 


Effect modification and the propensity score. A reason why matched and unmatched estimates may differ is effect 
modification. As an example, consider the common setting in which the number of untreated individuals is much 
larger than the number of treated individuals. Propensity matching often results in almost all treated individuals being 
matched and many untreated individuals being unmatched and therefore excluded from the analysis. When this occurs, 
the distribution of causal effect modifiers in the matched population will resemble that in the treated. Therefore, the 
effect in the matched population will be closer to the effect in the treated than to the effect that would have been 
estimated by methods that use data from the entire population. See Technical Point 4.1 for alternative ways to estimate 
the effect of treatment in the treated via IP weighting and standardization. 

Effect modification across propensity strata may be interpreted as evidence that decision makers know what they 
are doing, e.g. that doctors tend to treat patients who are more likely to benefit from treatment (Kurth et al 2006). 
However, the presence of effect modification by 7(L) may complicate the interpretation of the estimates. Consider a 
situation with qualitative effect modification: “Doctor, according to our study, this drug is beneficial for patients who 
have a propensity score between 0.11 and 0.93 when they arrive at your office, but it may kill those with propensity 
scores below 0.11,” or “Ms. Minister, let's apply this educational intervention to children with propensity scores below 
0.57 only.” The above statements are of little policy relevance because, as discussed in the main text, they are not 
expressed in terms of the measured variables L. 

Finally, besides effect modification, there are other reasons why matched estimates may differ from the overall effect 
estimate: violations of positivity in the non-matched, an unmeasured confounder that is more/less prevalent (or that is 
better/worse measured) in the matched population than in the unmatched population, etc. As discussed for individual 
variables L in Chapter 4, remember that effect modification might be explained by differences in residual confounding 
across propensity strata. 


they have a direct causal interpretation as outcome differences under differ- 
ent treatment values a. We have described two classes of structural models: 
marginal structural models and structural nested models. Marginal structural 
models include parameters for treatment, for the variables V that may be ef- 
fect modifiers, and for product terms between treatment and variables V. The 
choice of V reflects only the investigator’s substantive interest in effect mod- 
ification (see Section 12.5). If no covariates V are included, then the model 
is truly marginal. If all variables L are included as possible effect modifiers, 


See Fine Point 14.1 for a discussion 
of the relation between structural 
nested models and faux semipara- 
metric marginal structural models, 
and other subtleties. 


A study found that Facebook Likes 
predict sexual orientation, politi- 
cal views, and personality traits 
(Kosinski et al, 2013). Low in- 
telligence was predicted by, among 
other things, a “Harley Davidson” 
Like. This is purely predictive, not 
necessarily causal. 


then the marginal structural model becomes a faux marginal structural model. 
Structural nested models include parameters for treatment and for product 
terms between treatment A and all variables in L that are effect modifiers. 


We have presented outcome regression as a method to estimate the para- 
meters of faux marginal structural models for causal inference. However, out- 
come regression is also widely used for purely predictive, as opposed to causal, 
purposes. For example, online retailers use sophisticated outcome regression 
models to predict which customers are more likely to purchase their products. 
The goal is not to determine whether your age, sex, income, geographic origin, 
and previous purchases have a causal effect on your current purchase. Rather, 
the goal is to identify those customers who are more likely to make a purchase 
so that specific marketing programs can be targeted to them. It is all about 
association, not causation. Similarly, doctors use algorithms based on outcome 
regression to identify patients at high risk of developing a serious disease or 
dying. The parameters of these predictive models do not necessarily have any 
causal interpretation and all covariates in the model have the same status, i.e., 
there are no treatment variable A and variables L. 


The dual use of outcome regression in both causal inference method and 
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It is not uncommon for propen- 
sity analyses to report measures of 
predictive power like Mallows’s Cp. 
The relevance of these measures for 
causal inference is questionable. 


If we perfectly predicted treatment, 
then all treated individuals would 
have (L) = 1 and all untreated 
individuals would have 7(L) = 0. 
There would be no overlap and the 
analysis would be impossible. 


in prediction has led to many misunderstandings. One of the most important 
misunderstandings has to do with variable selection procedures. When the in- 
terest lies exclusively on outcome prediction, investigators may want to select 
any variables that, when included as covariates in the model, improve its pre- 
dictive ability. Many well-known variable selection procedures—e.g., forward 
selection, backward elimination, stepwise selection—and more recent develop- 
ments in machine learning are used to enhance prediction. These are powerful 
tools for investigators who are interested in prediction, especially when dealing 
with very high-dimensional data. 

Unfortunately, statistics courses and textbooks have not always made a 
sharp difference between causal inference and prediction. As a result, these 
variable selection procedures for predictive models have often been applied to 
causal inference models. A possible result of this mismatch is the inclusion of 
superfluous—or even harmful—covariates in propensity models and structural 
models. Specifically, the application of predictive algorithms to causal inference 
models may result in inflated variances. 

The problem arises because of the widespread, but mistaken, belief that 
propensity models should predict treatment A as well as possible. Propensity 
models do not need to predict treatment very well. They just need to include 
the variables L that guarantee exchangeability. Covariates that are strongly 
associated with treatment, but are not necessary to guarantee exchangeability, 
do not help reduce bias. If these covariates were included in L, adjustment can 
actually result in estimates with very large variances. 

Consider the following example. Suppose all individuals in a certain study 
attend either hospital Aceso or hospital Panacea. Doctors in hospital Aceso 
give treatment A = 1 to 99% of the individuals, and those in hospital Panacea 
give A = 0 to 99% of the individuals. Suppose the variable Hospital has 
no effect on the outcome (except through its effect on treatment A) and is 
therefore not necessary to achieve conditional exchangeability. Say we decide 
to add Hospital as a covariate in our propensity model anyway. The propensity 
score 7(Z) in the target population is about 0.99 for individuals in hospital 
Aceso and 0.01 for those in hospital Panacea, but by chance we may end up 
with a study population in which everybody in hospital Aceso has A = 1 or 
everybody in hospital Panacea has A = 0 for some strata defined by L. That 
is, our effect estimate may have a near-infinite variance without any reduction 
in confounding. That treatment is now very well predicted is irrelevant for 
causal inference purposes. 

Besides variance inflation, a predictive attitude towards variable selection 
for causal inference models—both propensity models and outcome regression 
models—may also result in self-inflicted bias. For example, the inclusion of 
colliders as covariates may result in systematic bias even if colliders may be 
effective covariates for purely predictive purposes. We will return to these 
issues in Chapter 18. 

All causal inference methods based on models—propensity models and 
structural models—require no misspecification of the functional form for the 
covariates. To reduce the possibility of model misspecification, we use flexible 
specifications, e.g., cubic splines rather than linear terms. In addition, these 
causal inference methods require the conditions of exchangeability, positivity, 
and well-defined interventions for unbiased causal inferences. In the next chap- 
ter we describe a very different type of causal inference method that does not 
require exchangeability as we know it. 
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Chapter 16 


INSTRUMENTAL VARIABLE ESTIMATION 


The causal inference methods described so far in this book rely on a key untestable assumption: all variables needed 
to adjust for confounding and selection bias have been identified and correctly measured. If this assumption is 
incorrect—and it will always be to a certain extent—there will be residual bias in our causal estimates. 

It turns out that there exist other methods that can validly estimate causal effects under an alternative set 
of assumptions that do not require measuring all adjustment factors. Instrumental variable estimation is one of 
those methods. Economists and other social scientists reading this book can breathe now. We are finally going to 
describe a very common method in their fields, a method that is unlike any other we have discussed so far. 


16.1 The three instrumental conditions 
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Figure 16.1 


Figure 16.2 


Condition (ii) would not be guar- 
anteed if, for example, partici- 
pants were inadvertently unblinded 
by side effects of treatment. 


The causal diagram in Figure 16.1 depicts the structure of a double-blind 
randomized trial with noncompliance: Z is the randomization assignment in- 
dicator (1: treatment, 0: placebo), A is an indicator for receiving treatment (1: 
yes, 0: no), Y the outcome, and U all factors (some unmeasured) that affect 
both the outcome and the adherence to the assigned treatment. 

Suppose we want to consistently estimate the average causal effect of A on 
Y. Whether we use IP weighting, standardization, g-estimation, stratification, 
or matching, we need to correctly measure, and adjust for, variables that block 
the backdoor path A — U — Y, i.e., we need to ensure conditional exchange- 
ability of the treated and the untreated. Unfortunately, all these methods will 
result in biased effect estimates if some of the necessary variables are unmea- 
sured, imperfectly measured, or misspecified in the model. 

Instrumental variable (IV) methods are different: they may be used to 
identify the average causal effect of A on Y in this randomized trial, even if we 
did not measure the variables normally required to adjust for the confound- 
ing caused by U. To perform their magic, IV methods need an instrumental 
variable Z, or an instrument. A variable Z is an instrument because it meets 
three instrumental conditions: 

(i) Z is associated with A 

(ii) Z does not affect Y except through its potential effect on A 

(iii) Z and Y do not share causes 

See Technical Point 16.1 for a more rigorous definition of these conditions. 

In the double-blind randomized trial described above, the randomization 
indicator Z is an instrument. Condition (i) is met because trial participants are 
more likely to receive treatment if they were assigned to treatment, condition 
(ii) is expected by the double-blind design, and condition (iii) is expected by 
the random assignment of Z. 

Figure 16.1 depicts a special case in which the instrument Z has a causal 
effect on the treatment A. We then refer to Z as a causal instrument. Other 
instruments do not have a causal effect on treatment A. The variable Z in 
Figure 16.2 also meets the instrumental conditions with the Z-A association 
(i) now resulting from the cause Uz shared by Z and A. We then refer to 
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Technical Point 16.1 


The instrumental conditions, formally. Instrumental condition (i), sometimes referred to as the relevance condition, 
is non-null association between Z and A, or ZILA does not hold. 

Instrumental condition (ii), commonly known as the exclusion restriction, is the condition of “no direct effect of 
Z on Y.” At the individual level, condition (ii) is ¥°° = Y, °° = Y£ for all z, 2’, all a, all individuals i. However, 
for some results presented in this chapter, only the population level condition (ii) is needed, i.e., E[Y**] = E [yer], 


Both versions of condition (ii) are trivially true for proxy instruments. 

Instrumental condition (iii) can be written as marginal exchangeability Y%* 1LZ for all a,z, which holds in the 
SWIGs corresponding to Figures 16.1, 16.2, and 16.3. Together with condition (ii) at the individual level, it implies 
Y° ILZ. A stronger condition (iii) is joint exchangeability, or {Y**; a € [0, 1], z € [0, 1]} 1. Z for dichotomous treatment 
and instrument. See Technical Point 2.1 for a discussion on different types of exchangeability and Technical Point 16.2 
for a description of results that require each version of exchangeability. Both versions of condition (iii) are expected to 
hold in randomized experiments in which the instrument Z is the randomized assignment. 





Z Uz as the unmeasured causal instrument and to Z as the measured surrogate 
Se or prozy instrument. (That Z and Y have Uz as a common cause does not 
violate condition (iii) because Uz is a causal instrument; see Technical Point 

We 16.1.) Figure 16.3 depicts another case of proxy instrument Z in a selected 
population: the Z-A association arises from conditioning on a common effect 


OZ > A—>Y S of the unmeasured causal instrument Uz and the proxy instrument Z. Both 
causal and proxy instruments can be used for IV estimation, with some caveats 

described in Section 16.4. 
U In previous chapters we have estimated the effect of smoking cessation on 
weight change using various causal inference methods applied to observational 
Figure 16.3 data. To estimate this effect using IV methods, we need an instrument Z. 


Since there is no randomization indicator in an observational study, consider 
the following candidate for an instrument: the price of cigarettes. It can 
be argued that this variable meets the three instrumental conditions if (i) 
cigarette price affects the decision to quit smoking, (ii) cigarette price affects 
weight change only through its effect on smoking cessation, and (iii) no common 
causes of cigarette price and weight change exist. Fine Point 16.1 reviews some 
proposed instruments in observational studies. 
To fix ideas, let us propose an instrument Z that takes value 1 when the 
Condition (i) is met if the candi- average price of a pack of cigarettes in the U.S. state where the individual was 
date instrument Z “price in state born was greater than $1.50, and takes value 0 otherwise. Unfortunately, we 
of birth” is associated with smok- cannot determine whether our variable Z is actually an instrument. Of the 
ing cessation A through the unmea- three instrumental conditions, only condition (i) is empirically verifiable. To 
sured variable Uz “price in placeof verify this condition we need to confirm that the proposed instrument Z and the 
residence" . treatment A are associated, i.e., that Pr [A = 1|Z = 1] — Pr [A = 1|Z = 0] > 
0. The probability of quitting smoking is 25.8% among those with Z = 1 
and 19.5% among those with Z = 0; the risk difference Pr [A = 1|Z = 1] — 
Pr [A = 1|Z = 0] is therefore 6%. When, as in this case, Z and A are weakly 
associated, Z is often referred as a weak instrument (more on weak instruments 
in Section 16.5). 
On the other hand, conditions (ii) and (iii) cannot be empirically verified. 
To verify condition (ii), we would need to prove that Z can only cause the 
outcome Y through the treatment A. We cannot prove it by conditioning on 
A, which is a collider on the pathway Z — Uz AU Y, because 
that would induce an association between Z and Y even if condition (ii) held 
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Fine Point 16.1 


Candidate instruments in observational studies. Many variables have been proposed as instruments in observational 
studies and it is not possible to review all of them here. Three commonly used categories of candidate instruments are 


e Genetic factors: The proposed instrument is a genetic variant Z that is associated with treatment A and that, 


supposedly, is only related with the outcome Y through A. For example, when estimating the effects of alcohol 
intake on the risk of coronary heart disease, Z can be a polymorphism associated with alcohol metabolism (say, 
ALDH2 in Asian populations). Causal inference from observational data via IV estimation using genetic variants is 
part of the framework known as Mendelian randomization (Katan 1986, Davey Smith and Ebrahim 2004, Didelez 
and Sheehan 2007, VanderWeele et al. 2014). 


Preference: The proposed instrument Z is a measure of the physician’s (or a care provider's) preference for one 
treatment over the other. The idea is that a physician’s preference influences the prescribed treatment A without 
having a direct effect on the outcome Y. For example, when estimating the effect of prescribing COX-2 selective 
versus non-selective nonsteroidal anti-inflammatory drugs on gastrointestinal bleeding, Uz can be the physician’s 
prescribing preference for drug class (COX-2 selective or non-selective). Because Uz is unmeasured, investigators 
replace it in the analysis by a (measured) surrogate instrument Z, such as “last prescription issued by the physician 
before current prescription” (Korn and Baumrind 1998, Earle et al. 2001, Brookhart and Schneeweiss 2007). 


Access: The proposed instrument Z is a measure of access to the treatment. The idea is that access impacts the 
use of treatment A but does not directly affect the outcome Y. For example, physical distance or travel time to 
a facility has been proposed as an instrument for treatments available at such facilities (McClellan et al. 1994, 
Card 1995, Baiocchi et al. 2010). Another example: calendar period has been proposed as an instrument for a 
treatment whose accessibility varies over time (Hoover et al. 1994, Detels et al. 1998). In the main text we use 
“price of the treatment”, another measure of access, as a candidate instrument. 





true. And we cannot, of course, prove that condition (iii) holds because we 
can never rule out confounding for the effect of any variable. We can only 


Conditions (ii) and (iii) can some- 
times be empirically falsified by us- 
ing data on instrument, treatment, 
and outcome. However, falsifica- 
tion tests only reject the conditions 
for a small subset of violations. For 
most violations, the test has no sta- 
tistical power, even for an arbitrar- 
ily large sample size (Bonet 2001, 
Glymour et al. 2012). 


assume that conditions (ii) and (iii) hold. IV estimation, like all methods we 
have studied so far, is based on untestable assumptions. 

In observational studies we cannot prove that our proposed instrument Z 
is truly an instrument. We refer to Z as a proposed or candidate instrument 
because we can never guarantee that the structures represented in Figures 16.1 
and 16.2 are the ones that actually occur. The best we can do is to use subject- 
matter knowledge to build a case for why the proposed instrument Z may be 
reasonably assumed to meet conditions (ii) and (iii); this is similar to how 
we use subject-matter knowledge to justify the identifying assumptions of the 
methods described in previous chapters. 

But let us provisionally assume that Z is an instrument. Now what? Can 
we now see the magic of IV estimation in action? Can we consistently estimate 
the average causal effect of A on Y without having to identify and measure 
the confounders? Sadly, the answer is no without further assumptions. An 
instrument by itself does not allow us to identify the average causal effect of 
smoking cessation A on weight change Y, but only identifies certain upper and 
lower bounds. Typically, the bounds are very wide and often include the null 
value (see Technical Point 16.2). 

In our example, these bounds are not very helpful. They would only confirm 
what we already knew: smoking cessation can result in weight gain, weight loss, 
or no weight change. Unfortunately, that is all an instrument can offer unless 
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Technical Point 16.2 


Bounds: Partial identification of causal effects. For a dichotomous outcome Y, the average causal effect 
Pr ie = 1] — Pr [y= = 1] can take values between —1 (if all individuals develop the outcome unless they were 
treated) and 1 (if no individuals develop the outcome unless treated). The bounds of the average causal effect are 
(—1,1). The distance between these bounds can be cut in half by using the data: because for each individual we know 
the value of either her counterfactual outcome Y°~! (if the individual was actually treated) or Y°=® (if the individual 
was actually untreated), we can compute the causal effect after assigning the most extreme values possible to each 
individual's unknown counterfactual outcome. This will result in bounds of the average causal effect that are narrower 
but still include the null value 0. For a continuous outcome Y, deriving bounds requires the specification of the minimum 
and maximum values for the outcome; the width of the bounds will vary depending on the chosen values. 

The bounds for Pr [yest E 1] — Pr ane = 1] can be further narrowed when there exists a variable Z that 
meets instrumental condition (ii) at the population level and marginal exchangeability (iii) (Robins 1989; Manski 1990). 
The width of these so-called natural bounds, Pr[A = 1|Z = 0] + Pr[A = 0|Z = 1], is narrower than that of the 
bounds identified from the data alone. Sometimes narrower bounds—the so-called sharp bounds—can be achieved 
when marginal exchangeability is replaced by joint exchangeability (Balke and Pearl 1997; Richardson and Robins 2014). 

The conditions necessary to achieve the sharp bounds can also be derived from the SWIGs under joint interventions 
on z and a corresponding to any of the causal diagrams depicted in Figures 16.1, 16.2, and 16.3. Richardson and 
Robins (2010, 2014) showed that the conditions Y%*1L (Z, A) |U and Z1LU, together with a population level condition 
(ii) within levels of U, i.e., E[Y**|U] = E [yesu], are sufficient to obtain the sharp bounds. Specifically, these 
conditions imply ZILU, Y 1LZ|U, A, and that E[Y*:] is given by the g-formula f E[Y|A = a,U = u] dF (u) ignoring 
Z, which reflects that Z has no direct effect on Y within levels of U. Dawid (2003) proved that these latter conditions 
lead to the sharp bounds. Under further assumptions, Richardson and Robins derived yet narrower bounds. See also 
Richardson, Evans, and Robins (2011). 

Unfortunately, all these partial identification methods (i.e., methods for bounding the effect) are often relatively 
uninformative because the bounds are wide. Swanson et al (2018) review partial identification methods for binary 
instruments, treatments, and outcomes. Swanson et al. (2015c) describe a real-world application of several partial 
identification methods and discuss their relative advantages and disadvantages. 

There is a way to decrease the width of the bounds: making parametric assumptions about the form of the effect 
of A on Y. Under sufficiently strong assumptions described in Section 16.2, the upper and lower bounds converge into 
a single number and the average causal effect is point identified. 


one is willing to make additional unverifiable assumptions. Sections 16.3 and 
16.4 review additional conditions under which the IV estimand is the average 
causal effect. Before that, we review the methods to do so. 


16.2 The usual IV estimand 


When a dichotomous variable Z is an instrument, i.e., meets the three instru- 
We will focus on dichotomous in- mental conditions (i)-(iii), and an additional condition (iv) described in the 
struments, which are the common- next section holds, then the average causal effect of treatment on the additive 
est ones. For a continuous instru- scale E [y=] -E [x40] is identified and equals 
ment Z, the usual IV estimand is 


ps. , where Cou means covari- E[Y|Z = 1] -E[Y|Z =0] 
ance. E[A|Z = 1] — E[A|Z = 0]’ 


which is the usual IV estimand for a dichotomous instrument. (Note E[A|Z = 1] = 
Pr [A = 1|Z = 1] for a dichotomous treatment). Technical Point 16.3 provides 


16.2 The usual IV estimand 


In randomized experiments, the IV 
estimator is the ratio of two effects 
of Z: the effect of Z on Y and the 
effect of Z on A. Each of these ef- 
fects can be consistently estimated 
without adjustment because Z is 
randomly assigned. 


Also known as the Wald estimator 
(Wald 1940). 


CODE: Program 16.1 

For simplicity, we exclude individu- 
als with missing outcome or instru- 
ment. In practice, we could use IP 
weighting to adjust for possible se- 
lection bias before using IV estima- 
tion. 


CODE: Program 16.2 
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a proof of this result in terms of an additive structural mean model, but you 
might want to wait until the next section before reading it. 

To intuitively understand the usual IV estimand, consider again the ran- 
domized trial from the previous section. The numerator of the IV estimand— 
the average causal effect of Z on Y—is the intention-to-treat effect, and the 
denominator—the average causal effect of Z on A—is a measure of adherence 
to, or compliance with, the assigned treatment. When there is perfect compli- 
ance, the denominator is equal to 1, and the effect of A on Y equals the effect 
of Z on Y. As compliance worsens, the denominator starts to get closer to 
0, and the effect of A on Y becomes greater than the effect of Z on Y. The 
greater the rate of noncompliance, the greater the difference between the effect 
of A on Y—the IV estimand—and the effect of Z on Y. 

The IV estimand bypasses the need to adjust for the confounders by in- 
flating the intention-to-treat effect in the numerator. The magnitude of the 
inflation increases as compliance decreases, i.e., as the Z-A risk difference gets 
closer to zero. The same rationale applies to the instruments used in observa- 
tional studies, where the denominator of the IV estimator may equal either the 
causal effect of the causal instrument Z on A (Figure 16.1), or the noncausal 
association between the proxy instrument Z and the treatment A (Figures 16.2 
and 16.3). 

The standard IV estimator is the ratio of the estimates of the numerator 
and the denominator of the usual IV estimand. In our smoking cessation 
example with a dichotomous instrument Z (1: state with high cigarette price, 
0: otherwise), the numerator estimate E [Y|Z = 1]-E [Y|Z = 0] equals 2.686— 
2.536 = 0.1503 and the denominator E[A|Z = 1]— E [A|Z = 0] equals 0.2578 — 
0.1951 = 0.0627. Therefore, the usual IV estimate is the ratio 0.1503/0.0627 = 
2.4 kg. Under the three instrumental conditions (i)-(iii) plus condition (iv) 
from next section, this is an estimate of the average causal effect of smoking 
cessation on weight gain in the population. 

We estimated the numerator and denominator of the IV estimand by simply 
calculating the four sample averages E[A|Z = 1], E[A|Z = 0], E[Y|Z = 1], and 
E [Y|Z = 0]. Equivalently, we could have fit two (saturated) linear models to 
estimate the differences in the denominator and the numerator. The model 
for the denominator would be E[A|Z] = ag + a, Z, and the model for the 
numerator E[Y|Z] = bo + 6,2. 

Linear models are used as an alternative method to calculate the stan- 
dard IV estimator: the two-stage-least-squares estimator. The procedure is 
as follows. First, fit the first-stage treatment model E[A|Z] = ao + a1Z, 
and generate the predicted values E [A|Z] for each individual. Second, fit the 
second-stage outcome model E[Y|Z] = 69 + BE [A|Z]. The parameter esti- 
mate Bi will always be numerically equivalent to the standard IV estimate. 
Thus, in our example, the two-stage-least-squares estimate was again 2.4 kg. 

The 2.4 point estimate has a very large 95% confidence interval: —36.5 to 
41.3. This is expected for our proposed instrument because the Z-A association 
is weak and there is much uncertainty in the first-stage model. A commonly 
used rule of thumb is to declare an instrument as weak if the F-statistic from 
the first-stage model is less than 10 (it was a meager 0.8 in our example). We 
will revisit the problems raised by weak instruments in Section 16.5. 

The two-stage-least-squares estimator and its variations forces investiga- 
tors to make strong parametric assumptions. Some of these assumptions can 
be avoided by using additive or multiplicative structural mean models, like 
the ones described in Technical Points 16.3 and 16.4, for IV estimation. The 
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parameters of structural mean models can be estimated via g-estimation. The 
trade-offs involved in the choice between two-stage-least-squares linear models 
and structural mean models are similar to those involved in the choice be- 
tween outcome regression and structural nested models for non-IV estimation 
(see Chapters 14 and 15). 

Anyway, the above estimators are only valid when the usual IV estimand 
can be interpreted as the average causal effect of treatment A on the outcome 
Y. For that to be true, a fourth identifying condition needs to hold in addition 
to the three instrumental conditions. 


16.3 A fourth identifying condition: homogeneity 


Yet additive rank preservation was 
implicitly assumed in many early IV 
analyses using the two-stage-least- 
squares estimator. 


Even when instrumental condi- 
tion (iii) Y°1LZ holds—as in the 
SWIGs corresponding to Figures 
16.1, 16.2, and 16.3— Y*1LZ|A 
does not generally hold. Therefore 
the treatment effect may depend 
on instrument Z, i.e., the less ex- 
treme homogeneity condition may 
not hold. 


Also, Hernán and Robins (2006b) 
showed that, if U is an additive ef- 
fect modifier), then it would not be 
reasonable for us to believe that the 
previous additive homogeneity con- 
dition (iv) holds. 


The three instrumental conditions (i)-(iii) are insufficient to ensure that the IV 
estimand is the average causal effect of treatment A on Y. A fourth condition, 
effect homogeneity (iv), is needed. Here we describe four possible homogeneity 
conditions (iv) in order of (historical) appearance. 

The most extreme, and oldest, version of homogeneity condition (iv) is con- 
stant effect of treatment A on outcome Y across individuals. In our example, 
this constant effect condition would hold if smoking cessation made every in- 
dividual in the population gain (or lose) the same amount of body weight, say, 
exactly 2.4 kg. A constant effect is equivalent to additive rank preservation 
which, as we discussed in Section 14.4, is scientifically implausible for most 
treatments and outcomes—and impossible for dichotomous outcomes, except 
under the sharp null or universal harm (or benefit). In our example, we expect 
that, after quitting smoking, some individuals will gain a lot of weight, some 
will gain little, and others may even lose some weight. Therefore, we are not 
generally willing to accept the homogeneity assumption of constant effect as a 
reasonable condition (iv). 

A second, less extreme homogeneity condition (iv) for dichotomous Z and 
A is equality of the average causal effect of A on Y across levels of Z in 
both the treated and in the untreated, i.e., E[Y¢"! —Y°°|Z = 1, A =a] = 
E[y**!—Yy*°|Z = 0, A =a] for a = 0,1. This additive homogeneity condi- 
tion (iv) was the one used in the mathematical proof of Technical Point 16.3. 
An alternative homogeneity condition on the multiplicative scale is discussed 
in Technical Point 16.4. (This multiplicative homogeneity condition leads to 
an IV estimand that is different from the usual IV estimand.) 

The above homogeneity condition is expressed in terms that are not natu- 
rally intuitive. How can subject-matter experts provide arguments in support 
of a constant average causal effect within levels of the proposed instrument 
Z and the treatment A in any particular study? More natural—even if still 
untestable—homogeneity conditions (iv) would be stated in terms of effect 
modification by possibly known (even if unmeasured) confounders U. One 
such condition is that U is not an additive effect modifier, i.e., that the av- 
erage causal effect of A on Y is the same at every level of the unmeasured 
confounder U or E[Y*="|U] — E [Y*°|U] = E[Y*"1] —E|Y*°]. This third 
homogeneity condition (iv) is often implausible because some unmeasured con- 
founders may also be effect modifiers. For example, the magnitude of weight 
gain after smoking cessation may vary with prior intensity of smoking, which 
may itself be an unmeasured confounder for the effect of smoking cessation on 
weight gain. 

Another type of homogeneity condition (iv) is that the Z-A association on 
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Technical Point 16.3 


Additive structural mean models and IV estimation. Consider the following saturated, additive structural mean 
model for a dichotomous treatment A and an instrument Z as depicted in Figures 16.1, 16.2, or 16.3: 


E [y¢=! —Y*|A = 1, Z] = Bo + AZ 


This model can also be written as E [Y — Y2-°/ A, Z| = A (Bo + 61Z). The parameter ĝo is the average causal effect 
of treatment among the treated individuals with Z = 0, and {9 + (, is the average causal effect of treatment among 
the treated individuals with Z = 1. Thus 3; quantifies additive effect modification by Z. 

If we a priori assume that there is no additive effect modification by Z, then 3; = 0 and 6o is exactly the usual IV 
estimand (Robins 1994). That is, the usual IV estimand is the parameter of an additive structural mean model for the 
effect of treatment on the treated under no effect modification by Z. 

The proof is simple. When Z is an instrument, condition (ii) holds, which implies E [Y°=°|Z =1] = 
E [yo |Z = 0]. Using the structural model notation, this conditional mean independence can be rewritten as 
E[Y — A (bo + 61) |Z = 1] = E [Y — AGo|Z = 0]. Solving the above equation with 6; = 0 we have 


E[Y|Z = 1] - E[Y|Z = 0] 
Po = WAZ = EAZ = 0] 

You may wonder why we a priori set 6; = 0. The reason is that we have an equation with two unknowns (6o and 
31) and that equation exhausts the constraints on the data distribution implied by the three instrumental conditions. 
Since we need an additional constraint, which by definition will be untestable, we arbitrarily choose 3, = 0 (rather than, 
say, 3, = 2). This is what we mean when we say that an instrument is insufficient to identify the average causal effect. 

Therefore, to conclude that the average causal effect of treatment in the treated 6o = 
E[Ye!—ye|4=1,Z=2] = E[Y!—Y*°|A=1] equals the average causal effect in the study popu- 
lation E [Y?="] — E [Y2-°|—and thus that the usual IV estimand is E [Y°=}] — E [Y*=°]—we must assume that the 
effects of treatment in the treated and in the untreated are identical, which is an additional untestable assumption. 
Hence, under the additional assumption 3, = 0, 6o = E[Y°1—-Y*°|A=1,Z = z| =E[Y%!-Y*° A= 1] for 
any z is the average causal effect of treatment in the treated. To conclude that (9 is the average causal effect in 
the study population E[Y*=!] — E [Y?=°]—and thus that E[Y°=!] — E[Y°=°] is the usual IV estimand—we must 
assume that the effects of treatment are identical in the treated and in the untreated, i.e., the parameter for Z is also 
0 in the structural model for A = 0. This is an additional untestable assumption. 


the additive scale is constant across levels of the confounders U, i.e., E [A|Z = 1, U]— 

E[A|Z = 0,U] = E[A|Z = 1]—E[A|Z = 0]. Unlike the previous three versions 

of homogeneity conditions, this one is not guaranteed to hold under the sharp 

causal null. On the other hand, this version has some testable implications: 
Wang and Tchetgen-Tchetgen For dichotomous A, if some of the confounders are measured, then it must 
(2018) proposed these last two be the case that the difference is the same across levels of the measured con- 
homogeneity conditions. founders; for a continuous A, if we are willing to make additional assumptions 

about linearity, then the variance of the treatment A must be constant across 

levels of the instrument Z. Otherwise, the condition would not hold. 

Because of the perceived implausibility of the homogeneity conditions in 
many settings, the possibility that IV methods can validly estimate the average 
causal effect of treatment seems questionable. There are two approaches that 
bypass the homogeneity conditions. 

One approach is the introduction of baseline covariates in the models for IV 
estimation. To do so, it is safer to use structural mean models, which impose 
fewer parametric assumptions than two-stage-linear-squares estimators. The 
inclusion of covariates in a structural mean model allows the treatment effect 
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Multiplicative structural mean models and IV estimation. Consider the following saturated, multiplicative (log- 
linear) structural mean model for a dichotomous treatment A 
E [Y°=tHA=1,Z] 
E[Y*°|A = 1, Z] 





= exp (bo + 612), 


which can also be written as E [Y |A, Z] = E FSA; Z| exp [A (o + 61Z)]. For a dichotomous Y, exp (8o) is the 
causal risk ratio in the treated individuals with Z = 0 and exp (9 + 81) is the causal risk ratio in the treated with 
Z = 1. Thus (6; quantifies multiplicative effect modification by Z. If we a priori assume that 3, = 0O—and additionally 
assume no multiplicative effect modification by Z in the untreated—then the causal effect on the multiplicative (risk 
ratio) scale is E[Y°=1] / E [Y°=°] = exp (8o), and the causal effect on the additive (risk difference) scale is 


E[y*""| -E[y**] = E [Y]A = 0] (1— E [A]) [exp (£0) — 1] + E[Y|A = 1] E [A] [1 — exp (—Ao)] 


The proof, which relies on the instrumental conditions, can be found in Robins (1989) and Hernán and Robins (2006b). 

That is, if we assume a multiplicative structural mean model with no multiplicative effect modification by Z in the 
treated and in the untreated, then the average causal effect E |y=] -E aml remains identified, but no longer 
equals the usual IV estimator. As a consequence, our estimate of E [yea] -E [y=] will depend on whether we 
assume no additive or multiplicative effect modification by Z. Unfortunately, it is not possible to determine which, if 
either, assumption is true even if we had an infinite sample size (Robins 1994) because, when considering saturated 
additive or multiplicative structural mean models, we have more unknown parameters to estimate than equations to 
estimate them with. That is precisely why we need to make modelling assumptions such as homogeneity. 


Also, models can be used to in- in the treated to vary with Z by imposing constraints on how the treatment 
corporate multiple proposed in- effect varies within levels of the covariates. See Section 16.5. and Technical 
struments simultaneously, to han- Point 16.5 for more details on structural mean models with covariates. 
dle continuous treatments, and to Another approach is to use an alternative condition (iv) that does not 
estimate causal risk ratios when require effect homogeneity. When combined with the three instrumental con- 
the outcome is dichotomous (see ditions (i)-(iii), this alternative condition allows us to endow the usual IV 
Palmer et al. 2011 for a review). estimand with a causal interpretation, even though it does not suffice to iden- 
tify the average causal effect in the population. We review this alternative 
condition (iv) in the next section. 


16.4 An alternative fourth condition: monotonicity 


Consider again the double-blind randomized trial with randomization indicator 


jas ———$—$— 5 Z, treatment A, and outcome Y. For each individual in the trial, the coun- 
Always takers terfactual variable A75! is the value of treatment—1 or 0—that an individual 
AZ would have taken if he had been assigned to receive treatment (z = 1). The 


counterfactual variable A*=° is analogously defined as the treatment value if 
the individual had been assigned to receive no treatment (z = 0). 








T If we knew the values of the two counterfactual treatment variables A75! 
a 4 ; and A*~° for each individual, we could classify all individuals in the study 
population into four disjoint subpopulations: 
Figure 16.4 


1. Always-takers: Individuals who will always take treatment, regardless of 
the treatment group they were assigned to. That is, individuals with 
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More general structural mean models. Consider an additive structural mean model that allows for continuous and/or 
multivariate treatments A, instruments Z, and pre-instrument covariates V. Such model assumes 


E [Y —Y°|Z, A, V] =7(Z,4,V;¥) 


where y (Z, A, V; a) is a known function, w is an unknown (possibly vector-valued) parameter, and y (Z, A = 0,V;w) = 
0. That is, an additive structural mean model is a model for the average causal effect of treatment level A compared 
with treatment level 0 among the subset of individuals at level Z of the instrument and level V of the confounders 
whose observed treatment is precisely A. The parameters of this model can be identified via g-estimation under the 
conditional counterfactual mean independence assumption E [Y°=°|Z = 1,V] =E[Y°°|Z =0,V]. 

Analogously, a general multiplicative structural mean model assumes 


E[Y|Z,A,V] =E [YĦ°=]Z,A, v] exp [7 (Z, A, V; 4)] 


where y (Z,A, V;ẹŅ) is a known function, 7 is an unknown parameter vector, and y (Z,A =0,V;%) = 0. The 
parameters of this model can also be identified via g-estimation under analogous conditions. Identification conditions 
and efficient estimators for structural mean models were discussed by Robins (1994) and reviewed by Vansteelandt and 
Goetghebeur (2003). More generally, g-estimation of nested additive and multiplicative structural mean models can 
extend IV methods for time-fixed treatments and confounders to settings with time-varying treatments and confounders. 





both A*=! = 1 and A=? = 1. 


2. Never-takers: Individuals who will never take treatment, regardless of 
A7 the treatment group they were assigned to. That is, individuals with 
ee both A*=! = 0 and A?-° = 0. 
3. Compliers or cooperative: Individuals who will take treatment when 
1 assigned to treatment, and no treatment when assigned to no treatment. 
That is, individuals with A*—! = 1 and A?-° = 0. 








Figure 16.5 
4. Defiers or contrarians: Individuals who will take no treatment when 


assigned to treatment, and treatment when assigned to no treatment. 
17 That is, individuals with A*-! = 0 and A?-° = 1. 


A7 Note that these subpopulations—often referred as compliance types or prin- 
compliers cipal strata—are not generally identified. If we observe that an individual was 
o assigned to Z = 1 and took treatment A = 1, we do not know whether she is 
a complier or an always-taker. If we observe that an individual was assigned 
z=0 z=1 to Z = 1 and took treatment A = 0, we do not know whether he is a defier or 
; a never-taker. 
Figure 16.6 ; ; p : 
When no defiers exist, we say that there is monotonicity because the in- 
strument Z either does not change treatment A—as shown in Figure 16.4 for 
always-takers and Figure 16.5 for never-takers—or increases the value of treat- 
1— ment A—as shown in Figure 16.6 for compliers. For defiers, the instrument 
Z would decrease the value of treatment A—as shown in Figure 16.7. More 
AZ generally, monotonicity holds when A*=! > A?=° for all individuals. 

Now let us replace any of the homogeneity conditions from the last section 
by the monotonicity condition, which will become our new condition (iv). Then 
the usual IV estimand does not equal the average causal effect of treatment 
0 z E [Y°=t] —E[Y?~°] any more. Rather, under monotonicity (iv), the usual IV 








Defiers 








Figure 16.7 
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The “compliers average causal ef- 
fect” (CACE) is an example of 
a local average treatment effect 
(LATE) in a subpopulation, as op- 
posed to the global average causal 
effect in the entire population. 
Greenland (2000) refers to compli- 
ers aS cooperative, and to defiers as 
non-cooperative, to prevent confu- 
sion with the common concept of 
(observed) compliance in random- 
ized trials. 


Deaton (2010) on the effect in the 
compliers: "This goes beyond the 
old story of looking for an object 
where the light is strong enough to 
see; rather, we have control over 
the light, but choose to let it fall 
where it may and then proclaim 
that whatever it illuminates is what 
we were looking for all along." 


A mitigating factor is that, un- 
der strong assumptions, investiga- 
tors can characterize the compliers 
in terms of their distribution of the 
observed variables (Angrist and Pis- 
chke 2009, Baiocchi et al 2014). 
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estimand equals the average causal effect of treatment in the compliers, that 
is 


Bye = yea 21, a> = 0), 


Technical Point 16.6 shows a proof for this equality under the assumption that 
Z was effectively randomly assigned. As a sketch of the proof, the equality 
between the usual IV estimand and the effect in the compliers holds because 
the effect of assignment Z on Y—the numerator of the IV estimand—is a 
weighted average of the effect of Z in each of the four principal strata. However, 
the effect of Z on Y is exactly zero in always-takers and never-takers because 
the effect of Z is entirely mediated through A and the value of A in those 
subpopulations is fixed, regardless of the value of Z they are assigned to. Also, 
no defiers exist under monotonicity (iv). Therefore the numerator of the IV 
estimand is the effect of Z on Y in the compliers—which is the same as the 
effect of A on Y in the compliers—times the proportion of compliers in the 
population, which is precisely the denominator of the usual IV estimand. 

In observational studies, the usual IV estimand can also be used to estimate 
the effect in the compliers in the absence of defiers. Technically, there are no 
compliers or defiers in observational studies because the proposed instrument Z 
is not treatment assignment, but the term compliers refers to individuals with 
(A=! = 1, A*=° = 0) and the term defiers to those with (A?=! = 0, A?=° = 1). 
In our smoking cessation example, the compliers are the individuals who would 
quit smoking in a state with high cigarette price and who would not quit 
smoking in a state with low price. Conversely, the defiers are the individuals 
who would not quit smoking in a state with high cigarette price and who 
would quit smoking in a state with low price. If no defiers exist and the causal 
instrument is dichotomous (see below and Technical Point 16.6), then 2.4 kg 
is the IV effect estimate in the compliers. 

The replacement of homogeneity by monotonicity was welcomed in the 
mid-1990s as the salvation of IV methods. While homogeneity is often an 
implausible condition (iv), monotonicity appeared credible in many settings. 
IV methods under monotonicity (iv) cannot identify the average causal effect in 
the population, only in the subpopulation of compliers, but that seemed a price 
worth paying in order to keep powerful IV methods in our toolbox. However, 
the estimation of the average causal effect of treatment in the compliers under 
monotonicity (iv) has been criticized on several grounds. 

First, the relevance of the effect in the compliers is questionable. The 
subpopulation of compliers is not identified and, even though the proportion of 
compliers in the population can be calculated (it is the denominator of the usual 
IV estimand, see Technical Point 16.6), it varies from instrument to instrument 
and from study to study. Therefore, causal inferences about the effect in the 
compliers are difficult to use by decision makers. Should they prioritize the 
administration of treatment A = 1 to the entire population because treatment 
has been estimated to be beneficial among the compliers, which happen to be 
6% of the population in our example but could be a smaller or larger group 
in the real world? What if treatment is not as beneficial in always-takers 
and never-takers, the majority of the population? Unfortunately, the decision 
maker cannot know who is included in the 6%. Rather than arguing that the 
effect of the compliers is of primary interest, it may be more honest to accept 
that interest in this estimand is not the result of its practical relevance, but 
rather of the (often erroneous) perception that it is easy to identify. 

Second, monotonicity is not always a reasonable assumption in observa- 
tional studies. The absence of defiers seems a safe assumption in randomized 


16.4 An alternative fourth condition: monotonicity 


The example to the right was pro- 
posed by Swanson and Hernan 
(2014). Also Swanson et al (2015a) 
showed empirically the existence in 
defiers in an observational setting. 


Definition of monotonicity for a 
continuous causal instrument Uz: 
A» is a non-decreasing function 
of uz on the support of Uz (An- 
grist and Imbens 1995, Heckman 
and Vytlacil 1999). 


Swanson et al (2015) discuss the 
difficulties to define monotonicity, 
and introduce the concept of global 
and local monotonicity in observa- 
tional studies. 


Sommer and Zeger (1991), Imbens 
and Rubin (1997), and Greenland 
(2000) describe examples of full 
compliance in the control group. 
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trials: we do not expect that some individuals will provide consent for partici- 
pation in a trial with the perverse intention to do exactly the opposite of what 
they are asked to do. Further, monotonicity is ensured by design in trials in 
which those assigned to no treatment are prevented from receiving treatment, 
i.e., there are no always-takers or defiers. However, monotonicity is harder to 
justify for some instruments proposed in observational studies. Consider the 
proposed instrument “physician preference” to estimate the treatment effect 
in patients attending a clinic where two physicians with different preferences 
work. The first physician usually prefers to prescribe the treatment, but she 
makes exceptions for her patients with diabetes (because of some known con- 
traindications). The second usually prefers to not prescribe the treatment, but 
he makes exceptions for his more physically active patients (because of some 
perceived benefits). Any patient who was both physically active and diabetic 
would have been treated contrary to both of these physicians’ preferences, and 
therefore would be labeled as a defier. That is, monotonicity is unlikely to 
hold when the decision to treat is the result of weighing multiple criteria or 
dimensions of encouragement that include both risks and benefits. In these 
settings, the proportion of defiers may not be negligible. 


The situation is even more complicated for the proxy instruments Z rep- 
resented by Figures 16.2 and 16.3. If the causal instrument Uz is continuous 
(e.g., the true, unmeasured physician’s preference), then the standard IV es- 
timand using a dichotomous proxy instrument Z (e.g., some measured surro- 
gate of preference) is not the effect in a particular subpopulation of compliers. 
Rather, the standard IV estimand identifies a particular weighted average of 
the effect in all individuals in the population, which makes it difficult to in- 
terpret. Therefore the interpretation of the IV estimand as the effect in the 
compliers is questionable when the proposed dichotomous instrument is not 
causal, even if monotonicity held for the continuous causal instrument Uz (see 
Technical Point 16.6 for details). 


Last, but definitely not least important, the partitioning of the popula- 
tion into four subpopulations or principal strata may not be justifiable. In 
many realistic settings, the subpopulation of compliers is an ill-defined sub- 
set of the population. For example, using the proposed instrument “physician 
preference” in settings with multiple physicians, all physicians with the same 
preference level who could have seen a patient would have to treat the patient 
in the exact same way. This is not only an unrealistic assumption, but also 
essentially impossible to define in many observational studies in which it is un- 
known which physicians could have seen a patient. A stable partitioning into 
compliers, defiers, always takers and never takers also requires deterministic 
counterfactuals (not generally required to estimate average causal effects), no 
interference (e.g., I may be an always-taker, but decide not to take treatment 
when my friend doesn’t), absence of multiple versions of treatment and other 
forms of heterogeneity (a complier in one setting, or for a particular instrument, 
may not be a complier in another setting). 


In summary, if the effect in the compliers is considered to be of interest, 
relying on monotonicity (iv) seems a promising approach in double-blind ran- 
domized trials with two arms and all-or-nothing compliance, especially when 
one of the arms will exhibit full adherence by design. However, caution is 
needed when using this approach in more complex settings and observational 
studies, even if the proposed instrument were really an instrument. 
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Fine Point 16.2 
Defining weak instruments. There are two related, but different, definitions of weak instrument in the literature: 


1. An instrument is weak if the true value of the Z-A association—the denominator of the IV estimand—is “small.” 


2. An instrument is weak if the F-statistic associated to the observed Z-A association is “small,” typically meaning 
less than 10. 


In our smoking cessation example, the proposed instrument met both definitions: the risk difference was only 6% and 
the F-statistic was a meager 0.8. 

The first definition, based on the true value of the Z-A association, reminds us that, even if we had an infinite 
sample, the IV estimator greatly amplifies any biases in the numerator when using a proposed weak instrument (the 
second problem of weak instruments in the main text). The second definition, based on the statistical properties of 
the Z-A association, reminds us that, even if we had a perfect instrument Z, the IV estimator can be biased in finite 


samples (the third problem of weak instruments in the main text). 





16.5 The three instrumental conditions revisited 


In the context of linear models, 
Martens et al (2006) showed that 
instruments are guaranteed to be 
weak in the presence of strong con- 
founding, because a strong A-U as- 
sociation leaves little residual vari- 
ation for a strong A-Uz, or A-Z, 
association. 


Bound, Jaeger and Baker (1995) 
documented this bias. Their pa- 
per was followed by many others 
that investigated the shortcomings 
of weak instruments. 


The previous sections have discussed the relative advantages and disadvantages 
of choosing monotonicity or homogeneity as the condition (iv). Our discussion 
implicitly assumed that the proposed instrument Z was in fact an instrument. 
However, in observational studies, the proposed instrument Z will fail to be a 
valid instrument if it violates either of the instrumental conditions (ii) or (iii), 
and will be a weak instrument if it only barely meets condition (i). In all these 
cases, the use of IV estimation may result in substantial bias even if condition 
(iv) held perfectly. We now discuss each of the three instrumental conditions. 

Condition (i), a Z-A association, is empirically verifiable. Before declaring 
Z as their proposed instrument, investigators will check that Z is associated 
with treatment A. However, when the Z-A association is weak as in our 
smoking cessation example, the instrument is said to be weak (see Fine Point 
16.2). Three serious problems arise when the proposed instrument is weak. 

First, weak instruments yield effect estimates with wide 95% confidence 
intervals, as in our smoking cessation example in Section 16.2. Second, weak 
instruments amplify bias due to violations of conditions (ii) and (iii). A pro- 
posed instrument Z which is weakly associated with treatment A yields a 
small denominator of the IV estimator. Therefore, violations of conditions (ii) 
and (iii) that affect the numerator of the IV estimator (e.g., unmeasured con- 
founding for the instrument, a direct effect of the instrument) will be greatly 
exaggerated. In our example, any bias affecting the numerator of the IV esti- 
mator would be multiplied by approximately 15.9 (1/0.0627). Third, even in 
large samples, weak instruments introduce bias in the standard IV estimator 
and result in underestimation of its variance. That is, the effect estimate is 
in the wrong place and the width of the confidence interval around it is too 
narrow. 

To understand the nature of this third problem, consider a randomly gen- 
erated dichotomous variable Z. In an infinite population, the denominator 
of the IV estimand will be exactly zero—there is a zero association between 
treatment A and a completely random variable—and the IV estimate will be 
undefined. However, in a study with a finite sample, chance will lead to an as- 
sociation between the randomly generated Z and the unmeasured confounders 


16.5 The three instrumental conditions revisited 205 


CODE: Program 16.4 


Aa eT 
Z >A >Y 


ee 


U 
Figure 16.8 





A’ 


Z—> A—>Y 


Figure 16.9 


Figure 16.10 


Cope: Program 16.5 


U—and therefore between Z and treatment A—that is weak but not exactly 
zero. If we propose this random Z as an instrument, the denominator of the 
IV estimator will be very small rather than zero. As a result the numerator 
will be incorrectly inflated, which will yield potentially very large bias. In fact, 
our proposed instrument “Price higher than $1.50” behaves like a randomly 
generated variable. Had we decided to define Z as price higher than $1.60, 
$1.70, $1.80, or $1.90, the IV estimate would have been 41.3, —40.9, —21.1, or 
—12.8 kg, respectively. In each case, the 95% confidence interval around the es- 
timate was huge, though still an underestimate of the true uncertainty. Given 
how much bias and variability weak instruments may create, a strong proposed 
instrument that slightly violates conditions (ii) and (iii) may be preferable to 
a less invalid, but weaker, proposed instrument. 

Condition (ii), the absence of a direct effect of the instrument on the out- 
come, cannot be verified from the data. A deviation from condition (ii) can 
be represented by a direct arrow from the instrument Z to the outcome Y, as 
shown in Figure 16.8. This direct effect of the instrument that is not mediated 
through treatment A will contribute to the numerator of the IV estimator, and 
it will be incorrectly inflated by the denominator as if it were part of the effect 
of treatment A. Condition (ii) may be violated when a continuous or multi- 
valued treatment A is replaced in the analysis by a coarser (e.g., dichotomized) 
version A*. Figure 16.9 shows that, even if condition (ii) holds for the original 
treatment A, it does not have to hold for its dichotomized version A*, because 
the path Z — A — Y represents a direct effect of the instrument Z that is 
not mediated through the treatment A* whose effect is being estimated in the 
IV analysis. In practice, many treatments are replaced by coarser versions 
for simplicity of interpretation. Coarsening of treatment is problematic for IV 
estimation, but not necessarily for the methods discussed in previous chapters. 

Condition (iii), no confounding for the effect of the instrument on the out- 
come, is also unverifiable. Figure 16.10 shows confounding due to common 
causes of the proposed instrument Z and outcome Y that are (U1) and are 
not (U2) shared with treatment A. In observational studies, the possibility of 
confounding for the proposed instrument always exists (same as for any other 
variable not under the investigator’s control). Confounding contributes to the 
numerator of the IV estimator and is incorrectly inflated by the denominator 
as if it were part of the effect of treatment A on the outcome Y. 

Sometimes condition (iii), and the other conditions too, can appear more 
plausible within levels of the measured covariates. Rather than making the 
unverifiable assumption that there is absolutely no confounding for the effect 
of Z on Y, we might feel more comfortable making the unverifiable assumption 
that there is no unmeasured confounding for the effect of Z on Y within levels of 
the measured pre-instrument covariates V. We could then apply IV estimation 
repeatedly in each stratum of V, and pool the IV effect estimates under the 
assumption that the effect in the population (under homogeneity) or in the 
compliers (under monotonicity) is constant within levels of V. Alternatively 
we could include the variables V as covariates in the two-stage modeling. In 
our example, this reduced the size of the effect estimate and increased its 95% 
confidence interval. 

Another frequent strategy to support condition (iii) is to check for bal- 
anced distributions of the measured confounders across levels of the proposed 
instrument Z. The idea is that, if the measured confounders are balanced, it 
may be more likely that the unmeasured ones are balanced too. However, this 
practice may offer a false sense of security: even small imbalances can lead 
to counterintuitively large biases because of the bias amplification discussed 
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Swanson et al (2015b) describe this 
selection bias in detail. 


16.6 Instrumental variable 


IV estimation is not the only 
method that requires modeling 
for identification of causal ef- 
fects. Other econometric ap- 
proaches like regression disconti- 
nuity analysis (Thistlewaite and 
Campbell 1960) do too. 


Baiocchi and Small (2014) review 
some approaches to quantify how 
sensitive IV estimates are to viola- 
tions of key assumptions. 


Instrumental variable estimation 


above. 

A violation of condition (iii) may occur even in the absence of confound- 
ing for the effect of Z on Y. The formal version of condition (iii) requires 
exchangeability between individuals with different levels of the proposed in- 
strument. Such exchangeability may be violated because of either confounding 
(see above) or selection bias. A surprisingly common way in which selection 
bias may be introduced in IV analyses is the exclusion of individuals with cer- 
tain values of treatment A. For example, if individuals in the population may 
receive treatment levels A = 0, A = 1, or A = 2, an IV analysis restricted to 
individuals with A = 1 or A = 2 may yield a non-null effect estimate even if 
the true causal effect is null. This exclusion does not introduce bias in non-IV 
analyses whose goal is to estimate the effect of treatment A = 1 versus A = 2. 

All the above problems related to conditions (i)-(iii) are exacerbated in IV 
analyses that use simultaneously multiple proposed instruments in an attempt 
to alleviate the weakness of a single proposed instrument. Unfortunately, the 
larger the number of proposed instruments, the more likely that some of them 
will violate one of the instrumental conditions. 


estimation versus other methods 


IV estimation differs from all previously discussed methods in at least three 
aspects. 

First, IV estimation requires modeling assumptions even if infinite data 
were available. This is not the case for previous methods like IP weighting 
or standardization: If we had treatment, outcome, and confounder data from 
all individuals in the super-population, we would simply calculate the average 
treatment effect as we did in Part I of this book, nonparametrically. In contrast, 
even if we had data on instrument, treatment, and outcome from all individuals 
in the super-population, IV estimation effectively requires the use of modeling 
assumptions in order to identify the average causal effect in the population. 
The homogeneity condition (iv) is mathematically equivalent to setting to zero 
the parameter corresponding to a product term in a structural mean model 
(see Technical Point 16.1). That is, IV estimation cannot be nonparametric— 
models are required for identification—which explains why the method was not 
discussed in Part I of this book. 

Second, relatively minor violations of conditions (i)-(iv) for IV estimation 
may result in large biases of unpredictable or counterintuitive direction. The 
foundation of IV estimation is that the denominator blows up the numerator. 
Therefore, when the conditions do not hold perfectly or the instrument is weak, 
there is potential for explosive bias in either direction. As a result, an IV es- 
timate may often be more biased than an unadjusted estimate. In contrast, 
previous methods tend to result in slightly biased estimates when their iden- 
tifiability conditions are only slightly violated, and adjustment is less likely to 
introduce a large bias. The exquisite sensitivity of IV estimates to departures 
from its identifiability conditions makes the method especially dangerous for 
novice users, and highlights the importance of sensitivity analyses. In addition, 
it is often easier to use subject-matter knowledge to think about unmeasured 
confounders of the effect of A on Y and how they may bias our estimates than 
to think about unmeasured confounders of the effect of Z on Y and how they 
and the existence of defiers or effect heterogeneity may bias our estimates. 

Third, the ideal setting for the applicability of standard IV estimation is 


16.6 Instrumental variable estimation versus other methods 207 


Transparency requires proper re- 
porting of IV analyses. See some 
suggested guidelines by Brookhart 
et al (2010), Swanson and Hernán 
(2013), and Baiocchi and Small 
(2014). 


more restrictive than that for other methods. As discussed in this chapter, 
standard IV estimation is better reserved for settings with lots of unmeasured 
confounding, a truly dichotomous and time-fixed treatment A, a strong and 
causal proposed instrument Z, and in which either effect homogeneity is ex- 
pected to hold, or one is genuinely interested in the effect in the compliers and 
monotonicity is expected to hold. A consequence of these restrictions is that 
IV estimation is generally used to answer relatively simple causal questions, 
such as the contrast A = 1 versus A = 0. For this reason, IV estimation will 
not be a prominent method in Part III of this book, which is devoted to time- 
varying treatments and the contrast of complex treatment strategies that are 
sustained over time. 

Causal inference relies on transparency of assumptions and on triangulation 
of results from methods that depend on different sets of assumptions. IV 
estimation is therefore an attractive approach because it depends on a different 
set of assumptions than other methods. However, because of the wide 95% 
confidence intervals typical of IV estimates, the value added by using this 
approach will often be small. Also, users of IV estimation need to be critically 
aware of the limitations of the method. While this statement obviously applies 
to any causal inference method, the potentially counterintuitive direction and 
magnitude of bias in IV estimation requires especial attention. 


208 Instrumental variable estimation 





Technical Point 16.6 


Monotonicity and the effect in the compliers. Consider a dichotomous causal instrument Z, like the randomization 
indicator described in the text, and treatment A. Imbens and Angrist (1994) proved that the usual IV estimand equals 
the average causal effect in the compliers E [Y° DO) ARN — A7 = 1] under monotonicity (iv), i.e., when no 
defiers exist. Baker and Lindeman (1994) had a related proof for a binary outcome. See also Angrist, Imbens, and 
Rubin (1996), and the associated discussion, and Baker, Kramer, and Lindeman (2016). A proof follows. 

The intention-to-treat effect can be written as the weighted average of the intention-to-treat effects in the four 
principal strata: 
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However, the intention-to-treat effect in both the always-takers and the never-takers is zero, because Z does not 
affect A in these two strata and, by individual-level condition (ii) of Technical Point 16.1, Z has no independent effect 
on Y. If we assume that no defiers exist, then the above sum is simplified to 


E [Y4 —y*°] = E [Y4 — y*9| A"? = 1, A*° = 0] Pr [At = 1, A7? =0] (compliers). 


But, in the compliers, the effect of Z on Y equals the effect of A on Y (because Z = A), that is 
E [Y= — y*|4*=1 = 1, A7=? = 0] = E[Y*=1 — Y9|A*1 = 1, A7=0 = 0]. Therefore, the effect in the com- 
pliers is 
7 E [yest = tga 

Pr [A?=! = 1, A?-° = 0] 


which is the usual IV estimand if we assume that Z is randomly assigned, as random assignment implies 
ZAL{Y%*, A*; z=0,1;a=0,1}. Under this joint independence and consistency, the intention-to-treat ef- 
fect E [y= — Y==°] in the numerator equals E[Y|Z = 1] — E[Y|Z = 0], and the proportion of compliers 
Pr [A751 = 1, A*-° = 0] in the denominator equals Pr[A = 1|Z = 1] — Pr[A = 1|Z = 0]. To see why the latter 
equality holds, note that the proportion of always-takers Pr | A7=0 1] Pr [A = 1|Z = 0] and the proportion of 
never-takers Pr [A7=t = 0] = Pr [A = 0|Z = 1]. Since, under monotonicity (iv), there are no defiers, the proportion of 
compliers Pr [A7=} — A7=? = 1] is the remainder 1 — Pr [A = 1|Z = 0] — Pr [A = 0|Z = 1] = 
1— Pr[A = 1|Z = 0] — (1 — Pr [A = 1|Z = 1]) = Pr [A = 1|Z = 1] — Pr [A = 1|Z = 0], which completes the proof. 
The above proof only considers the setting depicted in Figure 16.1 in which the instrument Z is causal. When, 
as depicted in Figures 16.2 and 16.3, data on a surrogate instrument Z—but not on the causal instrument Uz—are 
available, Hernán and Robins (2006b) proved that the average causal effect in the compliers (defined according to Uz) 
is also identified by the usual IV estimator. Their proof depends critically on two assumptions: that Z is independent 
of A and Y given the causal instrument Uz, and that Uz is binary. However, this independence assumption has often 
little substantive plausibility unless Uz is continuous. A corollary is that the interpretation of the IV estimand as the 
effect in the compliers is questionable in many applications of IV methods to observational data in which Z is at best a 
surrogate for Uz. 
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Chapter 17 
CAUSAL SURVIVAL ANALYSIS 


In previous chapters we have been concerned with causal questions about the treatment effects on outcomes 
occurring at a particular time point. For example, we have estimated the effect of smoking cessation on weight 
gain measured in the year 1982. Many causal questions, however, are concerned with treatment effects on the time 
until the occurrence of an event of interest. For example, we may want to estimate the causal effect of smoking 
cessation on the time until death, whenever death occurs. This is an example of a survival analysis. 

The use of the word “survival” does not imply that the event of interest must be death. The term “survival 
analysis”, or the equivalent term “failure time analysis”, is applied to any analyses about time to an event, where 
the event may be death, marriage, incarceration, cancer, flu infection, etc. Survival analyses require some special 
considerations and techniques because the failure time of many individuals may occur after the study has ended 
and is therefore unknown. This chapter outlines basic techniques for survival analysis in the simplified setting of 
time-fixed treatments. 


17.1 Hazards and risks 


Suppose we want to estimate the average causal effect of smoking cessation 
A (1: yes, 0: no) on the time to death T with time measured from the start 
of follow-up. This is an example of a survival analysis: the outcome is time 
to an event of interest that can occur at any time after the start of follow- 
up. In most follow-up studies, the event of interest is not observed to happen 
for all, or even the majority of, individuals in the study. This is so because 
most follow-up studies have a date after which there is no information on any 
individuals: the administrative end of follow-up. 
After the administrative end of follow-up, no additional data can be used. 
Individuals who do not develop the event of interest before the administrative 
end of follow-up have their survival time administratively censored, that is, we 
know that they survived beyond the administrative end of follow-up, but we 
do not know for how much longer. For example, let us say that we conduct 
the above survival analysis among the 1629 cigarette smokers from the NHEFS 
who were aged 25-74 years at baseline and who were alive through 1982. For 
all individuals, the start of follow-up is January 1, 1983 and the administrative 
end of follow-up is December 31, 1992. We define the administrative censoring 
time to be the difference between the date of administrative end of follow-up 
and date at which follow-up begins. In our example, this time is the same—120 
In a study with staggered entry months—for all individuals because the start of follow-up and the administra- 
(i.e., with a variable start of follow- tive end of follow-up are the same for everybody. Of the 1629 individuals, 
up date) different individuals will only 318 individuals died before the end of 1992, so the survival time of the 
have different administrative cen- remaining 1311 individuals is administratively censored. 


soring times, even when the admin- Administrative censoring is a problem intrinsic to survival analyses—studies 
istrative end of follow-up date is of smoking cessation and death will rarely, if ever, follow a cohort of individuals 
common to all. until extinction—but administrative censoring is not the only type of censoring 


that may occur in survival analyses. Like any other causal analyses, survival 
analysis may also need to handle non-administrative types of censoring, such 
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Fine Point 17.1 


Competing events. As described in Section 8.5, a competing event is an event (typically, death) that prevents the 
event of interest (e.g., stroke) from happening: individual who die from other causes (say, cancer) cannot ever develop 
stroke. In survival analyses, the key decision is whether to consider competing events a form of non-administrative 
censoring. 


e |f the competing event is considered a censoring event, then the analysis is effectively an attempt to simulate 
a population in which death from other causes is somehow either abolished or rendered independent of the risk 
factors for stroke. The resulting effect estimate is hard to interpret and may not correspond to a meaningful 
estimand (see Chapter 8). In addition, the censoring may introduce selection bias under the null, which would 
require adjustment (by, say, IP weighting) using data on the measured risk factors for the event of interest. 


e If the competing event is not considered a censoring event, then the analysis effectively sets the time to event 
to be infinite. That is, dead individuals are considered to have probability zero of developing stroke between 
their death and the administrative end of follow-up. The estimate of the effect of treatment on stroke is hard to 
interpret because a non-null estimate may arise from a direct effect of treatment on death, which would prevent 
the occurrence of stroke. 


An alternative to the handling of competing events is to create a composite event that includes both the competing 
event and the event of interest (e.g., death and stroke) and conduct a survival analysis for the composite event. 
This approach effectively eliminates the competing events, but fundamentally changes the causal question. Again, the 
resulting effect estimate is hard to interpret because a non-null estimate may arise from either an effect of treatment 
on stroke or on death. Another alternative is to restrict the inference to the principal stratum of individuals who would 
not die regardless of the treatment level they received. This approach targets a sort of local average effect, as defined 
in Chapter 16, which makes both interpretation and valid estimation especially challenging. 

None of the above strategies provides a satisfactory solution to the problem of competing events. Indeed the 
presence of competing events raises logical questions about the meaning of the causal estimand that cannot be bypassed 
by statistical techniques. For a detailed description of approaches to handle competing events and their challenges, see 
the discussion by Young et al. (2019). 





as loss to follow-up (e.g., dropout from the study) and competing events (see 
Fine Point 17.1). In previous chapters we have discussed how to adjust for the 
selection bias introduced by non-administrative censoring via standardization 
or IP weighting. The same approaches can be applied to survival analyses. 
Therefore, in this chapter, we will focus on administrative censoring. We defer 
a more detailed consideration of non-administrative censoring to Part III of the 
book because non-administrative censoring is generally a time-varying process, 
For simplicity, we assume that any- whereas the time of administrative censoring is fixed at baseline. 
one without confirmed death sur- In our example, the month of death T can take values subsequent from 1 
vived the follow-up period. In real- (January 1983) to 120 (December 1992). T is known for 102 treated (A = 1) 
ity, some individuals may have died and 216 untreated (A = 0) individuals who died during the follow-up, and is 
but confirmation (by, say, a death administratively censored (that is, all we know is that it is greater than 120 
certificate or a proxy interview) was months) for the remaining 1311 individuals. Therefore we cannot compute the 
not feasible. Also for simplicity, we mean survival E[T’] as we did in previous chapters with the outcome of interest. 
will ignore the problem described in Rather, in survival analysis we need to use other measures that can accom- 
Fine Point 12.1. modate administrative censoring. Some common measures are the survival 
probability, the risk, and the hazard. Let us define these quantities, which are 
functions of the survival time T. 
The survival probability Pr|T > k], or simply the survival at month k, is 
the proportion of individuals who survived through time k. If we calculate the 
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Other effect measures that can be 
derived from survival curves are 
years of life lost and the restricted 
mean survival time. 
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A common statistical test to com- 
pare survival curves (the log-rank 
test) yielded a P-value= 0.005. 
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survivals at each month until the administrative end of follow-up kena = 120 
and plot them along a horizontal time axis, we obtain the survival curve. 
The survival curve starts at Pr[T > 0] = 1 for k = 0 and then decreases 
monotonically—that is, it does not increase—with subsequent values of k = 
1,2,..-Keng- Alternatively, we can define risk, or cumulative incidence, at time 
k as one minus the survival 1 — Pr[T!>k] = Pr[T < k]. The cumulative 
incidence curve starts at Pr[T < 0] = 0 and increases monotonically during 
the follow-up. 

In survival analyses, a natural approach to quantify the treatment effect is 
to compare the survival or risk under each treatment level at some or all times 
k. Of course, in our smoking cessation example, a contrast of these curves 
may not have a causal interpretation because the treated and the untreated 
are probably not exchangeable. However, suppose for a second (actually, until 
Section 17.4) that quitters (A = 1) and non-quitters (A = 0) are marginally 
exchangeable. Then we can construct the survival curves shown in Figure 17.1 
and compare Pr[T > k|A = 1] versus Pr|T > k|A = 0] for all times k. For 
example, the survival at 120 months was 76.2% among quitters and 82.0% 
among non-quitters. Alternatively, we could contrast the risks rather than the 
survivals. For example, the 120-month risk was 23.8% among quitters and 
18.0% among non-quitters. 

At any time k, we can also calculate the proportion of individuals who 
develop the event among those who had not developed it before k. This is 
the hazard Pr [|T = k|T > k — 1]. Technically, this is the discrete time hazard, 
that is, the hazard in a study in which time is measured in discrete intervals— 
as opposed to measured continuously. Because in real-world studies, time is 
indeed measured in discrete intervals (years, months, weeks, days...) rather 
than in a truly continuous fashion, here we will refer to the discrete time 
hazard as, simply, the hazard. 

The risk and the hazard are different measures. The denominator of the 
risk—the number of individuals at baseline—is constant across times k and its 
numerator—all events between baseline and k—is cumulative. That is, the risk 
will stay flat or increase as k increases. On the other hand, the denominator 
of the hazard—the number of individuals alive at k—varies over time t and 
its numerator includes only recent events—those during interval k. That is, 
the hazard may increase or decrease over time. In our example, the hazard at 
120 months was 0% among quitters (because the last death happened at 113 
months in this group) and 1/986 = 0.10% among non-quitters, and the hazard 
curves between 0 and 120 months had roughly the shape of a letter M. 

A frequent approach to quantify the treatment effect in survival analyses 
is to estimate the ratio of the hazards in the treated and the untreated, known 
as the hazard ratio. However, the hazard ratio is problematic for the reasons 
described in Fine Point 17.2. Therefore, the survival analyses in this book 
privilege survival/risk over hazard. However, that does not mean that we 
should completely ignore hazards. The estimation of hazards is often a useful 
intermediate step for the estimation of survivals and risks. 


17.2 From hazards to risks 


In survival analyses, there are two main ways to arrange the analytic dataset. 
In the first data arrangement each row of the database corresponds to one 
person. This data format—often referred to as the long or wide format when 
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Figure 17.2 


By definition, everybody had to sur- 
vive month 0 in order to be included 
in the dataset, i.e., Do = 0 for all 
individuals. 
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there are time-varying treatments and confounders—is the one we have used 
so far in this book. In the analyses of the previous section, the dataset had 
1629 rows, one per individual. 

In the second data arrangement each row of the database corresponds to 
a person-time. That is, the first row contains the information for person 1 at 
k = 0, the second row the information for person one at k = 1, the third row 
the information for person 1 at k = 2, and so on until the follow-up of person 
one ends. The next row contains the information of person 2 at k = 0, etc. 
This person-time data format is the one we will use in most survival analyses 
in this chapter and in all analysis with time-varying treatments in Part III. In 
our smoking cessation example, the person-time dataset has 176, 764 rows, one 
per person-month. 

To encode survival information through k in the person-time data format, 
it is helpful to define a time-varying indicator of event D;. For each person 
at each month k, the indicator D, takes value 1 if T < k and value 0 if 
T > k. The causal diagram in Figure 17.2 shows the treatment A and the 
event indicators Dı and Də at times 1 and 2, respectively. The variable U 
represents the (generally unmeasured) factors that increase susceptibility to 
the event of interest. Note that sometimes these susceptibility factors are 
time-varying too. In that case, they can be depicted in the causal diagram as 
Uo, U1..., and so on. Part III deals with the case in which the treatment itself 
is time-varying. 

In the person-time data format, the row for a particular individual at time 
k includes the indicator D,+1. In our example, the first row of the person-time 
dataset, for individual one at k = 0, includes the indicator D,, which is 1 if the 
individual died during month 1 and 0 otherwise; the second row, for individual 
one at k = 1, includes the indicator Dz, which is 1 if the individual died during 
month 2 and 0 otherwise; and so on. The last row in the dataset for each 
individual is either her first row with D,4, = 1 or the row corresponding to 
month 119. 

Using the time-varying outcome variable Dg, we can define survival at k as 
Pr [D;, = 0], which is equal to Pr [T > k], and risk at k as Pr [Dp = 1], which is 
equal to Pr |T < k]. The hazard at k is defined as Pr [D; = 1|D,-1 = 0]. For 
k = 1 the hazard is equal to the risk because everybody is, by definition, alive 
atk =0. 

The survival probability at k is the product of the conditional probabilities 
of having survived each interval between 0 and k. For example, the survival 
at k = 2, Pr [Də = 0], is equal to survival probability at k = 1, Pr [Dı = 0], 
times the survival probability at k = 2 conditional on having survived through 
k = 1, Pr[Dz = 0|D, = 0]. More generally, the survival at k is 


k 
Pr [Dx = 0] = || Pr [Dm = 0|Dm—1 = 0] 


m=1 


That is, the survival at k equals the product of one minus the hazard at all 
previous times. If we know the hazards through k we can easily compute the 
survival at k (or the risk at k, which is just one minus the survival). 

The hazard at k, Pr[D; = 1|Dp-ı = 0], can be estimated nonparametri- 
cally by dividing the number of cases during the interval k by the number of 
individuals alive at the end of interval k — 1. If we substitute this estimate 
into the above formula the resulting nonparametric estimate of the survival 
Pr [Dy = 0] at k is referred to as the Kaplan-Meier, or product-limit, estima- 
tor. Figure 17.1 was constructed using the Kaplan-Meier estimator, which is 
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Fine Point 17.2 


The hazards of hazard ratios. When using the hazard ratio as a measure of causal effect, two important properties 
of the hazard ratio need to be taken into account. 

First, because the hazards vary over time, the hazard ratio generally does too. That is, the ratio at time k may differ 
from that at time k + 1. However, many published survival analyses report a single hazard ratio, which is usually the 
consequence of fitting a Cox proportional hazards model that assumes a constant hazard ratio by ignoring interactions 
with time. The reported hazard ratio is a weighted average of the k-specific hazard ratios, which makes it hard to 
interpret. If the risk is rare and censoring only occurs at a common administrative censoring time kena, then the weight 
of the hazard ratio at time k is proportional to the total number of events among untreated individuals that occur at 
k. (Technically, the weights are equal to the conditional density at k of T given A = 0 and T < kena.) Because it is 
a weighted average, the reported hazard ratio may be 1 even if the survival curves are not identical. In contrast to the 
hazard ratio, survival and risks are always presented as depending on time, e.g., the 5-year survival, the 120-month risk. 

Second, even if we presented the time-specific hazard ratios, their causal interpretation is not straightforward. 
Suppose treatment kills all high-risk individuals by time k and has no effects on others. Then the hazard ratio at time 
k + 1 compares the treated and the untreated individuals who survived through k. In the treated group, the survivors 
are all low-risk individuals (because the high-risk ones have already been killed by treatment); in the untreated group, 
the survivors are a mixture of high-risk and low-risk individuals (because treatment did not weed out the former). As a 
result the hazard ratio at k + 1 will be less than 1 even though treatment is not beneficial for any individual. 

This apparent paradox is an example of selection bias due to conditioning on a post-treatment variable (i.e., 
being alive at k) which is affected by treatment. For example, the hazard ratio at time 2 is the probability 
Pr [D2 = 1|D, = 0, A] of the event at time 2 among those who survived time 1. As depicted in the causal diagram 
of Figure 17.3, the conditioning on the collider Dı will generally open the path A — Dı + U — Dp and therefore 
induce an association between treatment A and event Də among those with Dı = 0. This built-in selection bias of 
hazard ratios does not happen if the survival curves are the same in the treated and the untreated, that is, if there are 
no arrows from A into the indicators for the event. Hernán (2010) described an example of this problem. 





an excellent estimator of the survival curve, provided the total number of fail- 
ures over the follow up period is reasonably large. Typically, the number of 
A—>|D, >D cases during each interval is low (or even zero) and thus the nonparametric 
1 2 estimates of the hazard Pr [D; = 1|D,—1 = 0] at k will be very unstable. If 
our interest is in estimation of the hazard at a particular k, smoothing via a 
parametric model may be required (see Chapter 11 and Fine Point 17.3). 
U An easy way to parametrically estimate the hazards is to fit a logistic 
regression model for Pr[Dz41 = 1|D, = 0] that, at each k, is restricted to 
individuals who survived through k. The fit of this model is straightforward 
when using the person-time data format. In our example, we can estimate the 
hazards in the treated and the untreated by fitting the logistic model 


1|D, 0, A] 90,k a 0;:A +A x k+03A x k? 


Figure 17.3 


Functions other than the logit (e.g., 
the probit) can also be used to 
model dichotomous outcomes and 
therefore to estimate hazards. 








logit Pr |Dpk+1 


where ĝo, is a time-varying intercept that can be estimated by some flexible 
function of time such as 09,4 = Oo + 04k + @5k?. The flexible time-varying 
intercept allows for a time-varying hazard and the product terms between 
treatment A and time (02A x k +03A x k?) allow the hazard ratio to vary over 





CODE: Program 17.2 

Although each person occurs in 
multiple rows of the person-time 
data structure, the standard error of 
the parameter estimates outputted 
by a routine logistic regression pro- 
gram will be correct if the hazards 
model is correct. 


time. See Technical Point 17.1 for details on how a logistic model approximates 
a hazards model. 

We then compute estimates of the survival Pr [D,41 = 0|A = a] by multi- 
plying the estimates of one minus the estimates of Pr [Dx41 = 1|D, = 0, A = q] 
provided by the logistic model, separately for the treated and the untreated. 
Figure 17.4 shows the survival curves obtained after parametric estimation of 
the hazards. These curves are a smooth version of those in Figure 17.1. 
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Fine Point 17.3 


Models for survival analysis. Methods for survival analysis need to accommodate the expected censoring of failure 
times due to administrative end of follow-up. 

Nonparametric approaches to survival analysis, like constructing Kaplan-Meier curves, make no assumptions about 
the distribution of the unobserved failure times due to administrative censoring. On the other hand, parametric models 
for survival analysis assume a particular statistical distribution (e.g., exponential, Weibull) for the failure times or hazards. 
The logistic model described in the main text to estimate hazards is an example of a parametric model. 

Other models for survival analysis, like the Cox proportional hazards model and the accelerated failure time (AFT) 
model, do not assume a particular distribution for the failure times or hazards. In particular, these models are agnostic 
about the shape of the hazard when all covariates in the model have value zero—often referred to as the baseline hazard. 
These models, however, impose a priori restrictions on the relation between the baseline hazard and the hazard under 
other combinations of covariate values. As a result, these methods are referred to as semiparametric methods. 

See the book by Hosmer, Lemeshow, and May (2008) for a review of applied survival analysis. More formal 
descriptions can be found in the books by Fleming and Harrington (2005) and Kalbfleisch and Prentice (2002). 





The validity of this procedure requires no misspecification of the hazards 
model. In our example, this assumption seems plausible because we obtained 
essentially the same survival estimates as in the previous section when we 
estimated the survival in a fully nonparametric way. A 95% confidence interval 
around the survival estimates can be easily constructed via bootstrapping of 
the individuals in the population. 


17.3 Why censoring matters 


The only source of censoring in our study is a common administrative censoring 
time kena = 120 that is identical for all individuals. In this simple setting the 
procedure described in the previous section to estimate the survival is overkill. 
One can simply estimate the survival probabilities Pr [D,4, = 0|A = a] by the 
fraction of individuals who received treatment a and survived to k + 1, or 
by fitting separate logistic models for Pr [Dp+1 = 0|A] at each time, for k = 
Ol Rend. 

Now suppose that individuals start the follow-up at different dates—there 
is staggered entry into the study—but the administrative end of follow-up date 
is common to all. Because the administrative censoring time is the difference 
between the administrative end of follow-up and the time of start of follow-up, 
o 12 24 æ as œ z2 s s ww different individuals will have different administrative censoring times. In this 

Months of follow-up setting it is helpful to define a time-varying indicator Cp for censoring by time 
Figure 17.4 k. For each person at each month k, the indicator Cy, takes value 0 if the 
administrative end of follow-up is greater than k and takes value 1 otherwise. 
In the person-time data format, the row for a particular individual at time k 
includes the indicator C,.,1. We did not include this variable in our dataset 
because C4, = 0 for all individuals at all times k before 120 months. In the 
general case with random (i.e., individual-specific) administrative censoring, 
the indicator C41 will transition from 0 to 1 at different times k for different 
people. 
Our goal is to estimate the survival curve that would have been observed if 
nobody had been censored before kena, where kena is the maximum administra- 
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Technical Point 17.1 
Approximating the hazard ratio via a logistic model. The (discrete-time) hazard ratio np is 
exp (a1) at all times k+1 in the hazards model Pr [Dz41 = 1|Dp = 0, A] = Pr [Dk+1 = 1|Dp = 0, A = 0] x exp (a1 A). 
If we take logs on both sides of the equation, we obtain log Pr [Dx41 = 1|Dk = 0, A] = a0,~ + a1A where ag, = 
log Pr [Dk+1 = 1|Dk = 0, A = 0]. 
Suppose the hazard at k + 1 is small, i.e., Pr |Dk+1 = 1| Dp = 0, A] ~ 0. Then one minus the hazard at k + 1 is 


Pr[Dk+1=1|Dk=0,A] 
Pr[Dk+1=0|Dk=0,A] ` We 























close to one, and the hazard is approximately equal to the odds: Pr [Dk+1 = 1|D, = 0, A] ~ 
then have 





Pr [Dk+1 = 1| Dk = 0, A] 
Pr [Dk+1 = 0|Dk = 0, A] 


That is, if the hazard is close to zero at k + 1, we can approximate the log hazard ratio a; by 6; in a logistic model 
logit Pr [Dk+1 = 1| Dk = 0, A] = 00,5 + 01A like the one we used in the main text (Thompson 1977). As a rule of 
thumb, the approximation is often considered to be accurate enough when Pr [Dy41 = 1|D; = 0, A] < 0.1 for all k. 

This rare event condition can almost always be guaranteed to hold: we just need to define a time unit k that is 
short enough for Pr [Dp+1 = 1|Dp = 0, A] < 0.1. For example, if Dp stands for lung cancer, k may be measured in 
years; if Dz, stands for infection with the common cold virus, k may be measured in days. The shorter the time unit, 
the more rows in the person-time dataset used to fit the logistic model. 


log = logit Pr [Dear = 1| Dk = 0, A] X A0,k + a A 











tive censoring time in the study. That is, our goal is to estimate the survival 
Pr [Dk = 0|A =a] that would have been observed if the value of the time- 
varying indicators D, were known even after censoring. Technically, we can 


also refer to this quantity as Pr [Dee = 0|A =a] where © = (€1,€2..-Cheng)- 
As discussed in Chapter 12, the use of the superscript € = 0 makes explicit the 
quantity that we have in mind. We sometimes choose to omit the superscript 
c = 0 when no confusion can arise. For simplicity, suppose that the time of 
start of follow-up was as if randomly assigned to each individual, which would 
be the case if there were no secular trends in any variable. Then the admin- 
istrative censoring time, and therefore the indicator C, is independent of both 
treatment and death time. 

We cannot validly estimate this survival Pr[D, =0|A =a] at time k by 
simply computing the fraction of individuals who received treatment level a and 
survived and were not censored through k. This fraction is a valid estimator 
of the joint probability Pr [Ck+1 = 0, Dg41 = 0|A = a], which is not what we 
want. To see why, consider a study with kena = 2 and in which the following 
happens: 


e Pr[C, = 0] = 1, i.e., nobody is censored by k = 1 
e Pr[D: = 0|Co = 0] = 0.9, i.e., 90% of individuals survive through k = 1 








e Pr[Cz = 0|D, = 0, C1 = 0] = 0.5, i.e., a random half of survivors is cen- 
sored by k = 2 











e Pr[Dz = 0|C2 = 0, Dı = 0, C1 = 0] = 0.9, i.e., 90% of the remaining in- 
dividuals survive through k = 2 


The fraction of uncensored survivors at k = 2 is 1 x 0.9 x 0.5 x 0.9 = 0.405. 
However, if nobody had been censored, i.e., if Pr [Cy = 0|D, = 0, C1 = 0] 
1, the survival would have been 1 x 0.9 x 1 x 0.9 = 0.81. This example 
motivates how correct estimation of the survivals Pr |D, = 0|A = a] requires 
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the procedures described in the previous section. Specifically, under (as if) 
randomly assigned censoring, the survival Pr |D; = 0|A = a] at k is 


k 
II Pr [Dm = 0|Dm_—1 = 0, Cm = 0, A = a] for k < kena 


m=1 








The estimation procedure is the same as described above except that we either 
use a nonparametric estimate of, or fit a logistic model for, the cause-specific 
hazard Pr [Dpk+1 = 1|Dk = 0, Ck+1 = 0, A = q]. 

Often we are not ready to assume that censoring is as if randomly assigned. 
When there is staggered entry, an individual’s time of administrative censoring 
depends on the calendar time at study entry (later entries have shorter values 
of kena) and calendar time may itself be associated with the outcome. There- 
fore, the above procedure will need to be adjusted for baseline calendar time. 
In addition, there may be other baseline prognostic factors that are unequally 
distributed between the treated (A = 1) and the untreated (A = 0), which also 
requires adjustment. The next sections extend the above procedure to incorpo- 
rate adjustment for baseline confounders via g-methods. In Part III we extend 
the procedure to settings with time-varying treatments and confounders. 


17.4 IP weighting of marginal structural models 
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FIGURE 17.5 


When the treated and the untreated are not exchangeable, a direct contrast 
of their survival curves cannot be endowed with a causal interpretation. In 
our smoking cessation example, we estimated that the 120-month survival was 
lower in quitters than in non-quitters (76.2% versus 82.0%), but that does not 
necessarily imply that smoking cessation increases mortality. Older people are 
more likely to quit smoking and also more likely to die. This confounding by 
age makes smoking cessation look bad because the proportion of older people 
is greater among quitters than among non-quitters. 

Let us define re ca as a counterfactual time-varying indicator for death 
at k under treatment level a and no censoring. For simplicity of notation, we 
will write DYI as D% when, as in this chapter, it is clear that the goal is 
estimating the survival in the absence of censoring. For additional simplicity, in 
the remainder of this chapter we omit Cp = 0 from the conditioning event of the 
hazard at k, Pr [Dk+1 = 0|D, = 0, L = l, A]. That is, we write all expressions 
as if all individuals had a common administrative censoring time, like in our 
smoking cessation example. 

Suppose we want to compare the counterfactual survivals Pr | D871 = 0] 
and Pr [D@;? = 0] that would have been observed if everybody had received 
treatment (a = 1) and no treatment (a = 0), respectively. That is, the causal 
contrast of interest is 


Pr | Die = 0). vs. (Pr (Da —0) for k= 0,2,..kena— 1 


Because of confounding, this contrast may not be validly estimated by the 
contrast of the survivals Pr [Dy41 = 0|A = 1] and Pr [Dz41 = 0|A = 0] that we 
described in the previous sections. Rather, a valid estimation of the quan- 
tities Pr [Dg,, = 0] for a = 1 and a = 0 typically requires adjustment for 
confounders, which can be achieved through several methods. This section 
focuses on IP weighting. 


17.5 The parametric g-formula 
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Let us assume that the treated (A = 1) and the untreated (A = 0) are 
exchangeable within levels of the L variables, as represented in the causal 
diagram of Figure 17.5. Like in Chapters 12 to 15, L includes the variables 
sex, age, race, education, intensity and duration of smoking, physical activity 
in daily life, recreational exercise, and weight. We also assume positivity and 
consistency. The estimation of IP weighted survival curves has two steps. 

First, we estimate the stabilized IP weight SW for each individual in 
the study population. The procedure is exactly the same as the one de- 
scribed in Chapter 12. We fit a logistic model for the conditional probabil- 
ity Pr[A = 1|L] of treatment (i.e., smoking cessation) given the variables in 
L. The denominator of the estimated SW4 is Pr [A = 1|L] for treated indi- 
viduals and (1 -Pr [A= \Z]) for untreated individuals, where Pr [A = 1|L] 
is the predicted value from the logistic model. The numerator of the esti- 
mated weight SW4 is Pr [A = 1] for the treated and (1 -Pr [A = i) for the 


untreated, where Pr [A = 1] can be estimated nonparametrically or as the pre- 
dicted value from a logistic model for the marginal probability Pr [A = 1] of 
treatment. See Chapter 11 for details on predicted values. 

The application of the estimated weights SW“ creates a pseudo-population 
in which the variables in L are independent from the treatment A, which 
eliminates confounding by those variables. In our example, the weights had 
mean 1 (as expected) and ranged from 0.33 to 4.21. 

Second, using the person-time data format, we fit a hazards model like the 
one described above except that individuals are weighted by their estimated 
SWA. Technically, this IP weighted logistic model estimates the parameters 
of the marginal structural logistic model 


logit Pr [Dg = 0|D% = 0] = Bo,x + Bia + Boa x k + psa x k? 


That is, the IP weighted model estimates the time-varying hazards that would 
have been observed if all individuals in the study population had been treated 
(a = 1) and the time-varying hazards if they had been untreated (a = 0). 

The estimates of Pr |D; = 0|D% = 0] from the IP weighted hazards mod- 
els can then be multiplied over time (see previous section) to obtain an estimate 
of the survival Pr [D@,, = 0] that would have been observed under treatment 
a = 1 and under no treatment a = 0. The resulting curves are shown in Figure 
17.6. 

In our example, the 120-month survival estimates were 80.7% under smok- 
ing cessation and 80.5% under no smoking cessation; difference 0.2% (95% con- 
fidence interval from —4.1% to 3.7% based on 500 bootstrap samples). Though 
the survival curve under treatment was lower than the curve under no treat- 
ment for most of the follow-up, the maximum difference never exceeded —1.4% 
with a 95% confidence interval from —3.4% to 0.7%. That is, after adjustment 
for the covariates L via IP weighting, we found little evidence of an effect of 
smoking cessation on mortality at any time during the follow-up. The validity 
of this procedure requires no misspecification of both the treatment model and 
the marginal hazards model. 


17.5 The parametric g-formula 


In the previous section we estimated the survival curve under treatment and 
under no treatment in the entire study population via IP weighting. To do 
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In Chapter 12 we referred to models 
conditional on all the covariates L 
as faux marginal structural models. 
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so, we adjusted for L and assumed exchangeability, positivity, and consistency. 
Another method to estimate the marginal survival curves under those assump- 
tions is standardization based on parametric models, that is, the parametric 
g-formula. 

The survival Pr [De = 0] at k+1 under treatment level a is the weighted 
average of the survival conditional probabilities at k + 1 within levels of the 
covariates L and treatment level A = a, with the proportion of individuals in 
each level | of L as the weights. That is, under exchangeability, positivity, and 
consistency, Pr [D¢,, = 0] equals the standardized survival 


X Pr [Deus = O|L =1,A =a] Pr[b=7]. 
l 


For a formal proof, see Section 2.3. 

Therefore, the estimation of the parametric g-formula has two steps. First, 
we need to estimate the conditional survivals Pr [D,41 = 0|L = l, A = a] using 
our administratively censored data. Second, we need to compute their weighted 
average over all values l of the covariates L. We describe each of these two 
steps in our smoking cessation example. 

For the first step we fit a parametric hazards model like the one described 
in Section 17.2, except that the variables in L are included as covariates. If 
the model is correctly specified, it validly estimates the time-varying hazards 
Pr [Dk+1 = 1|Dx = 0, L, A] within levels of treatment A and covariates L. The 
product of one minus the conditional hazards 








k 
Ll Pr [Dms = 2D = 0, L =1, A = a] 
m=0 


is equal to the conditional survival Pr[D,41 = 0|L = l, A =a]. Because of 
conditional exchangeability given L, the conditional survival for a particular 
set of covariate values L = l and A = a can be causally interpreted as the 
survival that would have been observed if everybody with that set of covariates 
had received treatment value a. That is, 








Pr [D+ = O[L = L, A =a] = Pr [Dg =0|L =] 


Therefore the conditional hazards can be used to estimate the survival curve 
under treatment (a = 1) and no treatment (a = 0) within each combination 
of values | of L. For example, we can use this model to estimate the survival 
curves under treatment and no treatment for white men aged 61, with college 
education, low levels of exercise, etc. However, our goal is estimating the 
marginal, not the conditional, survival curves under treatment and under no 
treatment. 

For the second step we compute the weighted average of the conditional 
survival across all values / of the covariates L, that is, we standardize the sur- 
vival to the confounder distribution. To do so, we use the method described 
in Section 13.3 to standardize means: standardization by averaging after ex- 
pansion of dataset, outcome modeling, and prediction. This method can be 
used even when some of the variables in L are continuous so that the sum over 
values | is formally an integral. The resulting curves are shown in Figure 17.7. 

In our example, the survival curve under treatment was lower than the curve 
under no treatment during the entire follow-up, but the maximum difference 
never exceeded —2.0% (95% confidence interval from —5.6% to 1.8%). The 120- 
month survival estimates were 80.4% under smoking cessation and 80.6% under 
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no smoking cessation; difference 0.2% (95% confidence interval from —4.6% to 
4.1%). That is, after adjustment for the covariates L via standardization, we 
found little evidence of an effect of smoking cessation on mortality at any time 
during the follow-up. Note that the survival curves estimated via IP weighting 
(previous section) and the parametric g-formula (this section) are similar but 
not identical because they rely on different parametric assumptions: the IP 
weighted estimates require no misspecification of a model for treatment and 
a model for the unconditional hazards; the parametric g-formula estimates 
require no misspecification of a model for the conditional hazards. 


17.6 G-estimation of structural nested models 


In fact, we may not even approxi- 
mate a hazard ratio because struc- 
tural nested logistic models do not 
generalize easily to time-varying 
treatments (Technical Point 14.1). 


Tchetgen Tchetgen et al (2015) 
and Robins (1997b) described sur- 
vival analysis with instrumental 
variables that exhibit similar prob- 
lems to those described here for 
structural nested models. 


The ‘nested’ component is only 
evident when treatment is time- 
varying. See Chapter 21. 


The negative sign in front of 4% pre- 
serves the usual interpretation of 
positive parameters indicating harm 
and negative parameters indicating 
benefit. 


The previous sections describe causal contrasts that compare survivals, or risks, 
under different levels of treatment A. The survival was computed from haz- 
ards estimated by logistic regression models. This approach is feasible when 
the analytic method is IP weighting of marginal structural models or the para- 
metric g-formula, but not when the method is g-estimation of structural nested 
models. As explained in Chapter 14, structural nested models are models for 
conditional causal contrasts (e.g., the difference or ratio of covariate-specific 
means under different treatment levels), not for the components of those con- 
trasts (e.g., each of the means under different treatment levels). Therefore we 
cannot estimate survivals or hazards using a structural nested model. 

We can, however, consider a structural nested log-linear model to model 
the ratio of cumulative incidences (i.e., risks) under different treatment. levels. 
Structural nested cumulative failure time models do precisely that (see Tech- 
nical Point 17.2), but they are best used when failure is a rare event because 
log-linear models do not naturally impose an upper limit of 1 on the risk. For 
non-rare failures, we can instead consider a structural nested log-linear model 
to model the ratio of cumulative survival probabilities (i.e., 1— risk) under dif- 
ferent treatment levels. Structural nested cumulative survival time models do 
precisely that (see Technical Point 17.2), but they are best used when survival 
is rare because log-linear models do not naturally impose an upper limit of 1 
on the survival. A more general option is to use a structural nested model that 
models the ratio of survival times under different treatment options. That is, 
an accelerated failure time (AFT) model. 

Let T? be the counterfactual time of survival for individual ¿į under treat- 
ment level a. The effect of treatment A on individual 7’s survival time can be 
measured by the ratio T?=!/T*=° of her counterfactual survival times under 
treatment and under no treatment. If the survival time ratio is greater than 1, 
then treatment is beneficial because it increases the survival time; if the ratio 
is less than 1, then treatment is harmful; if the ratio is 1, then treatment has 
no effect. Suppose, temporarily, that the effect of treatment is the same for 
every individual in the population. 

We could then consider the structural nested accelerated failure time (AFT) 
model T*/T2=° = exp (—wv1a), where Y, measures the expansion (or contrac- 
tion) of each individual’s survival time attributable to treatment. If Yı < 0 
then treatment increases survival time, if yı > 0 then treatment decreases 
survival time, if Yı = 0 then treatment does not affect survival time. More 
generally, the effect of treatment may depend on covariates L so a more general 
structural AFT would be T?/T2=° = exp (~a — weaL;), with Yı and Y2 (a 
vector) constant across individuals. Rearranging the terms, the model can be 
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Technical Point 17.2 


Structural nested cumulative failure time (CFT) models and cumulative survival time (CST) models. For a 
time-fixed treatment, a (non-nested) structural CFT model is a model for the ratio of the counterfactual risk under 
treatment value a divided by the counterfactual risk under treatment value 0 conditional on treatment A and covariates 
L. The general form of the model is 


ai 
oe = exp[yx(L, A; ¥)] 
r[De-? = 1|L, A] 

where yk(L, A; w) is a function of treatment and covariate history indexed by the (possibly vector-valued) parameter w. 
For consistency, exp|yx(L, A; 7)] must be 1 when A = 0 because then the two treatment values being compared are 
identical, and when there is no effect of treatment at time m on outcome at time k. An example of such a function is 
Yk(L, A; Y) = YA so w = 0 corresponds to no effect, p < 0 to beneficial effect, and Y > 0 to harmful effect. 

Analogously, for a time-fixed treatment, a (non-nested) structural CST model is a model for the ratio of the 
counterfactual survival under treatment value a divided by the counterfactual survival under treatment level 0 conditional 
on treatment A and covariates L. The general form of the model is 


Pr [DE = O|L, A] 
Pr [De=0 = OL, A] 





= exp[yz(L, A; ¥)] 


Although CFT and CST models differ only in whether we specify a multiplicative model for Pr |D} = 1|.L, A] versus 
for Pr |[D¢ = O|L, A], the meaning of yk(L, A; y) differs because a multiplicative model for risk is not a multiplicative 
model for survival, whenever the treatment effect is non-null. When we let the time index k be continuous rather than 
discrete, a structural CST model is equivalent to a structural additive hazards model (Tchetgen Tchetgen et al., 2015) 
as any model for Pr [D¢ = 0|L, A] / Pr [D¢-° = O|L, A] induces a model for the difference in the time-specific hazards 
of T! and T=, and vice-versa. 

The use of structural CFT models requires that, for all values of the covariates L, the conditional cumulative 
probability of failure under all treatment values satisfies a particular type of rare failure assumption. In this “rare failure” 
context, the structural CFT model has an advantage over AFT models: it admits unbiased estimating equations that 
are differentiable in the model parameters and thus are easily solved. Page (2005) and Picciotto et al. (2012) provided 
further details on structural CFT and CST models. For a time-varying treatment, this class of models can be viewed as 
a special case of the multivariate structural nested mean model (Robins 1994). See Technical Point 14.1. 





written as 
Ti = T? exp (yia + yeaL;) for all individuals i 


Following the same reasoning as in Chapter 14, consistency of counterfactu- 
als implies the model T¢=° = T; exp (Y1 Ai + W2A;L;), in which the counterfac- 
tual time T% is replaced by the actual survival time TA = T;. The parameters 
pı and Y can be estimated by a modified g-estimation procedure (to account 
for administrative censoring) that we describe later in this section. 

The above structural AFT is unrealistic because it is both deterministic 
and rank-preserving. It is deterministic because it assumes that, for each in- 
dividual, the counterfactual survival time under no treatment T¢=° can be 
computed without error as a function of the observed survival time T, treat- 
ment A, and covariates L. It is rank-preserving because, under this model, if 
individuals 7 would die before individual j had they both been untreated, i.e., 
i ae TAS} then individual i would also die before individual 7 had they 
both been treated, i.e., Tt < T5. 

Because of the implausibility of rank preservation, one should not generally 
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use methods for causal inference that rely on it, as we discussed in Chapter 14. 
And yet again we will use a rank-preserving model here to describe g-estimation 
for structural AFT models because g-estimation is easier to understand for 
rank-preserving models, and because the g-estimation procedure is actually 
Robins (1997b) described non- the same for rank-preserving and non-rank-preserving models. 
deterministic non-rank-preserving For simplicity, consider the simpler rank-preserving model T¢~° = T; exp (A;) 
structural nested AFT models. without a product term between treatment and covariates. G-estimation of the 
parameter w of this structural AFT model would be straightforward if admin- 
istrative censoring did not exist, that is, if we could observe the time of death 
T for all individuals. In fact, in that case the g-estimation procedure would be 
the same as we described in Section 14.5. The first step would be to compute 
candidate counterfactuals H;(Yt) = T; exp (wt Aj) under many possible values 
wi of the causal parameter p. The second step would be to find the value yt 
that results in a H;(u") that is independent of treatment A in a logistic model 
for the probability of A = 1 with H;(7') and the confounders L as covariates. 
Such value ut would be the g-estimate of 4. 

However, this procedure cannot be implemented in the presence of admin- 
istrative censoring at time K because H;(7') cannot be computed for individ- 
uals with unknown T;. One might then be tempted to restrict the g-estimation 
procedure to individuals with an observed survival time only, i.e., those with 
T; < K. Unfortunately, that approach results in selection bias. To see why, 
consider the following oversimplified scenario. 

We conduct a 60-month randomized experiment to estimate the effect of 
a dichotomous treatment A on survival time T. Only 3 types of individuals 
participate in our study. Type 1 individuals are those who, in the absence of 
treatment, would die at 36 months (T°=° = 36). Type 2 individuals are those 
who in the absence of treatment, would die at 72 months (T¢~° = 72). Type 3 
individuals are those who in the absence of treatment, would die at 108 months 
(T°=° = 108). That is, type 3 individuals have the best prognosis and type 
1 individuals have the worst one. Because of randomization, we expect that 
the proportions of type 1, type 2, and type 3 individuals are the same in each 
of the two treatment groups A = 1 and A = 0. That is, the treated and the 
untreated are expected to be exchangeable. 

Suppose that treatment A = 1 decreases the survival time compared with 
A=0. Table 17.1 shows the survival time under treatment and under no treat- 
ment for each type of individual. Because the administrative end of follow-up is 
K = 60 months, the death of type 1 individuals will be observed whether they 
are randomly assigned to A = 1 or A = 0 (both survival times are less than 60), 
and the death of type 3 individuals will be administratively censored whether 
they are randomly assigned to A = 1 or A = 0 (both survival times are greater 
than 60). The death of type 2 individuals, however, will only be observed if 
they are assigned to A = 1. Hence an analysis that welcomes all individuals 
with non-administratively censored death times will have an imbalance of in- 
dividual types between the treated and the untreated. Exchangeability will be 
broken because the A = 1 group will include type 1 and type 2 individuals, 
whereas the A = 0 group will include type 1 individuals only. Individuals in the 
A = 0 group will have, on average, a worse prognosis than those in the A = 1 
group, which will make treatment look better than it really is. This selection 
bias (Chapter 8) arises when treatment has a non-null effect on survival time. 

To avoid this selection bias, one needs to select individuals whose survival 
time would have been observed by the end of follow-up whether they had 
been treated or untreated, i.e., those with T@-° < K and T¢=! < K. In our 
example, we will have to exclude all type 2 individuals from the analysis in order 


Type 

1 2 3 
T°=° 36 72 108 
T=! 24 48 72 


Table 17.1 
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Technical Point 17.3 


Artificial censoring. Let K(w) be the minimum survival time under no treatment that could possibly correspond 
to an individual who actually died at time K (the administrative end of follow-up). For a dichotomous treatment 
A, K(w) = inf {K exp (wA)}, which implies that K(q) = K exp (4 x 0) = K if treatment contracts the survival 
time (i.e, Y > 0), K(w) = Kexp(w x 1) = K exp (4) if treatment expands the survival time (i.e., Yọ < 0), and 
K(w) = K exp (0) = K if treatment does not affect survival time (i.e., y = 0). 

All individuals who are administratively censored (i.e, T > K) have A(w) = 0 because there is at least one 
treatment level (the one they actually received) under which their survival time is greater than K, i.e., H(w) > K(w). 
Some of the individuals who are not administratively censored (i.e., T < K) also have A(~) = 0 and are excluded from 
the analysis—they are artificially censored—to avoid selection bias. 

The artificial censoring indicator A(w) is a function of H (Y) and K. Under conditional exchangeability given L, all 
such functions, when evaluated at the true value of Y, are conditionally independent of treatment A given the covariates 
L. That is, g-estimation of the AFT model parameters can be performed based on A(w) rather than H(q). Technically, 
A(w) is substituted for H (x) in the estimating equation of Technical Point 14.2. For practical estimation details, see 
the Appendix of Hernán et al (2005). 





to preserve exchangeability. That is, we will not only exclude administratively 


This exclusion of uncensored indi- 
viduals from the analysis is often 
referred to as artificial censoring. 


Cope: Program 17.5 

The point estimate of 7 is the value 
that corresponds to the minimum of 
the estimating function described in 
Technical Point 17.3.; the limits of 
the 95% confidence interval are the 
values that correspond to the value 
3.84 (x? with one degree of free- 
dom) of the estimating function. 


censored individuals with T; > K, but also some uncensored individuals with 
known survival time T; < K because their survival time would have been 
greater than K if they had received a treatment level different from the one 
they actually received. 

We then define an indicator A(7), which takes value 0 when an individual 
is excluded and 1 when she is not. The g-estimation procedure is then modified 
by replacing the variable H(t) by the indicator A(yt). See Technical Point 
17.3 for details. In our example, the g-estimate from the rank-preserving 
structural AFT model T?>° = T; exp (YA;) was —0.047 (95% confidence inter- 


val: —0.223 to 0.333). The number exp (-2) = 1.05 can be interpreted as the 


median survival time that would have been observed if all individuals in the 
study had received a = 1 divided by the median survival time that would have 
been observed if all individuals in the study had received a = 0. This survival 
time ratio suggests little effect of smoking cessation A on the time to death. 

As we said in Chapter 14, structural nested models, including AFT models, 
have rarely been used in practice. A practical obstacle for the implementation 
of the method is the lack of user-friendly software. An even more serious 
obstacle in the survival analysis setting is that the parameters of structural 
AFT models need to be estimated through search algorithms that are not 
guaranteed to find a unique solution. This problem is exacerbated for models 
with two or more parameters %. As a result, the few published applications 
of this method tend to use simplistic AFT models that do not allow for the 
treatment effect to vary across covariate values. 

This state of affairs is unfortunate because subject-matter knowledge (e.g., 
biological mechanisms) are easier to translate into parameters of structural 
AFT models than into those of structural hazards models. This is especially 
true when using non-deterministic and non-rank preserving structural AFT 
models. 


Chapter 18 
VARIABLE SELECTION FOR CAUSAL INFERENCE 


In the previous chapters, we have described several adjustment methods to estimate the causal effect of a treatment 
A on an outcome Y, including stratification and outcome regression, standardization and the parametric g-formula, 
IP weighting, and g-estimation. Each of these methods carry out the adjustment in different ways but all these 
methods rely on the same condition: the set of adjustment variables L must include sufficient information to 
achieve conditional exchangeability between the treated A = 1 and the untreated A = 0—or, equivalently, to block 


all backdoor paths between A and Y without opening other biasing paths. 
In practice, a common question is how to select the variables L for adjustment. This chapter offers some 


guidelines for variable selection when the goal of the data analysis is causal inference. 


Because the variable 


selection criteria for causal inference are not the same as for prediction, widespread procedures for variable selection 
in predictive analyses may not be directly transferable to causal analyses. This chapter summarizes the problems 
of incorrect variable selection in causal analyses and outlines some practical guidance. 


18.1 The different goals of variable selection 


Even if the outcome model includes 
all confounders for the effect of 
A on Y, the association between 
each confounder and the outcome 
cannot be causally interpreted be- 
cause we do not adjust for the con- 
founders of the confounders. 


Reminder: Confounding is a causal 
concept that does not apply when 
the estimand is an association 
rather than a causal effect. 


As we have discussed throughout this book, valid causal inferences usually 
require adjustment for confounding and other biases. When an association 
measure between a treatment A and an outcome Y may be partly or fully 
explained by confounders L, adjustment for these confounders needs to be 
incorporated into the data analysis. Otherwise, the association measure cannot 
be interpreted as a causal effect measure. 

But if the goal of the data analysis is purely predictive, no adjustment for 
confounding is necessary. If we just want to quantify the association between 
smoking cessation A and weight gain Y, we simply estimate that association 
from the data by comparing the average weight gain between those who did and 
did not quit smoking. More generally, if we want to develop a predictive model 
for weight gain, we will want to include covariates (like smoking cessation, 
baseline weight, and annual income) that predict weight gain. We do not 
ask the question of whether those covariates are confounders because there is 
no treatment variable whose effect can be confounded. In predictive models, 
we do not try to endow any parameter estimates with a causal interpretation 
and therefore we do not try to adjust for confounding because the concept of 
confounding does not even apply. 

The distinction between predictive/associational models and causal mod- 
els was discussed in Section 15.5. For example, clinical investigators can use 
outcome regression to identify patients at high risk of developing heart failure. 
The goal is classification, which is a form of prediction. The parameters of 
these predictive models do not necessarily have any causal interpretation and 
all covariates in the model have the same status, i.e., there are no treatment 
variable A and variables L. For example, a prior hospitalization may be iden- 
tified as a useful predictor of future heart failure, but nobody would suggest 
stop we hospitalizing people in order to prevent heart failures. Identifying 
patients with bad prognosis (prediction) is very different from identifying the 
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Fine Point 18.1 


Variable selection procedures for regression models. Suppose we want to fit a regression model with predictive 
purposes, but the database includes so many potential predictors—perhaps even more than individuals—that including 
all of them in the model is either impossible or results in very unstable predictions. Several approaches exist to deal with 
this problem in regression models. A detailed description of these procedures can be found in many books. See, for 
example, the books by Hastie, Tibshirani, and Friedman (2009), and by Harrell (2015). Below we briefly outline some 
of the existing approaches. 

One approach is to select a subset of the available variables. A conceptually simple way to find the best subset 
would be to first decide the number of variables in the model, then fit all possible combinations of models with that 
number of variables, and finally choose the best one according to some pre-specified criterion (e.g., Akaike’s Information 
criterion). However, this approach becomes computationally infeasible for a large number of available variables. More 
efficient methods to select variables are forward selection (start with no variables and, in each step of the algorithm, 
add the variable that leads to the greatest improvement), backward elimination (start with all variables and, in each 
step, delete the variable that leads to the greatest improvement), and stepwise selection (a combination of forward 
selection and backward elimination). The variable selection algorithm ends when no further improvement is possible, 
with improvement again defined according to some pre-specified criterion. These algorithms are easy to implement but, 
on the other hand, they do not explore all possible subsets of variables. 

An alternative to subset selection is shrinkage. The idea is to modify the estimation method by adding a “penalty” 
that forces the model parameter estimates (other than the intercept) to be closer to zero than they would have been in 
the absence of the penalty. That is, the parameter estimates are shrunk towards zero. As a result of this shrinkage, the 
variance decreases and the prediction becomes more stable. The two best known shrinkage methods are ridge regression 
and the /asso or “least absolute shrinkage and selection operator’, which was proposed by Santosa and Symes (1986) 
and rediscovered by Tibshirani (1996). Unlike ridge regression, the lasso allows some parameter values to be exactly 
zero. Therefore, the lasso is both a shrinkage method and a subset selection method. The lasso has become the preferred 
variable selection method for regression models, as it generally outperforms stepwise selection in terms of prediction 
accuracy. 





best course of action to prevent or treat a disease (causal inference). 


For pure prediction, investigators may want to select any variables that 
improve predictive ability. The selection of these variables can be automated 
to obtain the best possible prediction using data from the population of inter- 
est. Some automatic variable selection algorithms, like the lasso, are designed 
for predictive regression models (see Fine Point 18.1) whereas others are im- 
plemented as part of non-regression algorithms (e.g., neural networks). All 
of them can use cross-validation (see Fine Point 18.2) to optimize predictive 
accuracy. Because some selection algorithms are “black-box” procedures, it is 
not always easy to explain how the variables are selected or why the algorithm 
works. However, that does not necessarily matter. A reasonable point of view 
is that, for purely predictive purposes in a particular population and setting, 
whatever works to improve prediction is fair game, regardless of interpretabil- 
ity. 

In contrast, in a causal analysis, a thoughtful selection of confounders is 
needed before endowing the treatment parameter estimates with a causal inter- 
pretation. Automatic variable selection procedures may work for prediction, 
but not generally for causal inference. Variable selection algorithms may select 
variables that introduce bias in the effect estimate. There are several reasons 
why this bias may arise. Some of these reasons have been described earlier in 
the book; others have not been described yet. The next section summarizes 
all of them. 
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Fine Point 18.2 


Overfitting and cross-validation. Overfitting is a common problem of all variable selection methods for regression 
models: The variables are selected to predict the data points as well as possible, without taking in consideration that 
some of the variation observed in the data is purely random. As a result, the model predicts very well for the individuals 
used to estimate the model parameters, but the model predicts poorly for future individuals who were not used to estimate 
the model parameters. The same problem arises in predictive algorithms such as random forests, neural networks, and 
other machine learning algorithms. 

A straightforward solution to the overfitting problem is to split the sample in two parts: a training sample used to 
run the predictive algorithm (that is, to estimate the model parameters when using regression) and a validation sample 
used to evaluate the accuracy of the algorithm's predictions. For a sample size n, we use v individuals for the validation 
set and n — v individuals for the training set. When using the lasso, the degree of shrinkage in the training sample may 
be guided by the model's performance in the validation sample. 

The obvious downside of splitting the sample into training and validation subsamples is that the predictive algorithm 
only uses—e.g., the model parameters are estimated in—a subset of individuals, which increases the variance. A solution 
is to repeat the splitting process multiple times, which increases the effective number of individuals used by the predictive 
algorithm. Then one can evaluate the algorithm's predictive accuracy as the average over all the validation samples. 
This procedure is known as cross-validation or out-of-sample testing. Different forms of cross-validation exist. 

A procedure referred to as “leave-v-out cross-validation” analyzes all possible partitions of the sample into training 
sample and validation sample of size v. However, examining all such partitions may become computationally infeasible 
for moderately large values of n and v. Therefore, in practice, it is common to choose v = 1, a procedure referred 
to as “leave-one-out cross-validation.” When even leave-one-out cross-validation is too computationally intensive, one 
may consider a cross-validation procedure that does not exhaust all possible partitions. For example, in “k-fold cross 
validation,” the sample is split into k subsamples of equal size. Then one of the subsamples is used as the validation 
sample and the other k — 1 subsamples as the training sample. The procedure is repeated & times, with k = 10 being 
a common choice. See the book by Hastie, Tibshirani, and Friedman (2009) for a description of cross-validation and 
related techniques. 





18.2 Variables that induce or amplify bias 


Imagine that we have unlimited computational power and a dataset with a 
quasi-infinite number of individuals (the rows of the dataset) and many vari- 
ables measured for each individual (the columns of the data set), including 


A — L Y treatment A, outcome Y, and a large number of variables X, some of which 
may be confounders of the effect of A on Y. In this setting, we can afford to 
adjust for as many variables in the dataset as we wish, without computational, 

U numerical, or statistical constraints. 


Say that we want to unbiasedly estimate the average causal effect of a bi- 

Figure 18.1 nary treatment A on the outcome Y, that is, E [Y°=t] — E[Y*=°]. Then the 

goal of covariate adjustment is to eliminate as much confounding as possible 

by using the information contained in the measured variables X. We could 

Collapsibility reminder: When ad- easily adjust for all measured variables X via stratification/outcome regres- 

justing for covariates using strat- sion, standardization/g-formula, IP weighting, or g-estimation. Are there any 

ification, remember that the ad- reasons to adjust for only a subset of X rather than simply adjust for all avail- 

justed association measure may dif- able variables X? The answer is yes. Even in this ideal setting, we want to 

fer from the unadjusted association ensure that some variables are not selected for adjustment because adjustment 

measure, even when no confound- for those variables would induce bias. The next examples illustrate this point 
ing exists. See Fine Point 4.3. when some of the variables L in X are causally affected by A. 


Suppose the causal structure of the problem is represented by the causal 
diagram of Figure 18.1 (same as Figure 7.7) in which the variable L is a collider. 
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In Figure 18.4, adjusting for L 
blocks the path A — L — Y but 
not the path A — Y. Thus the 
A-Y association adjusted for L is a 
biased estimator of the total effect 
of A on Y but an unbiased esti- 
mator of the direct effect of A on 
Y that is not mediated through L 


Figure 18.4 


(Schisterman et al. 2009). 
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Here the average causal effect E[Y°=!] — E[Y°=°] = 0 is unbiasedly estimated 
by E[Y|A = 1]—E[Y|A = 0] since there is no confounding by L. Suppose now 
we try to estimate the average causal effect by adjusting for L, e.g., via the 
g-formula 5°, E[Y|A =1,L=1)Pr(L=1) —- X}, E[Y|A =0,L =l] Pr(L=l). 
This contrast differs from E[Y|A = 1] — E[Y|A = 0]—and thus is biased— 
because L is both conditionally associated with Y given A and marginally 
associated with A, so Pr (L = l) # Pr (L = I| A). Because the A-Y association 
adjusted for L is expected to be non-null even though the causal effect of 
treatment A on the outcome Y is null, we say that there is selection bias under 
the null. The same bias is expected to arise when we adjust for a variable L 
that, as in the causal diagram of Figure 18.2, is a descendant of the collider 
B. You may want to review Chapter 8 for more examples of causal structures 
with colliders and their descendants. 

Selection bias may also appear when adjusting for a noncollider affected 
by treatment, like the variable L in the causal diagram in Figure 18.3. Here 
the average causal effect E[Y°=1!] — E[Y°=°] 4 0 is also unbiasedly estimated 
by E[Y|A = 1] — E[Y|A = 0] since there is no confounding by L. However, if 
we try to estimate the average causal effect by adjusting for L, the g-formula 
contrast will differ from E[Y|A = 1] — E[Y|A = 0] for the same reasons as 
in the previous paragraph. Now suppose that the arrow from A to Y had 
been absent, that is, that the null hypothesis of no effect of A on Y were true 
and so E[Y*=1] — E[Y2=°] = 0. Then A and Y would be independent (both 
marginally and conditionally on L) and the g-formula contrast would be zero 
and thus unbiased. The key reason for this result is that, under the null, A 
no longer has a causal effect on L. That is, unlike in Figures 18.1 and 18.2, 
adjusting for L in Figure 18.3 results in selection bias only when A has a non- 
null causal effect on Y. We then say that there is selection bias under the 
alternative or off the null (see Section 6.5). 

When the adjustment variable is affected by the treatment A and affects 
the outcome Y, we say that the variable is a mediator. Consider the causal 
diagram in Figure 18.4, which includes the mediator L on a causal path from 
the treatment A to the outcome Y. The A-Y association adjusted for the 
mediator L, or its descendants, will differ from the effect of treatment A on 
the outcome Y because the adjustment blocks the component of the effect that 
goes through L. Sometimes this problem is referred to as overadjustment for 
mediators when the average causal effect of A on Y is the contrast of interest. 

The bias-inducing variables discussed above share a common feature: they 
are affected by treatment and therefore they are post-treatment variables. One 
might then think that we should always avoid adjustment for variables that 
occur after treatment A. The rule of not adjusting for post-treatment variables 
would be easy to follow because the temporal sequence of the adjustment vari- 
ables and the treatment is usually known. Unfortunately, following this simple 
rule may result in the exclusion of useful adjustment variables, as we discussed 
in Fine Point 7.4. Consider the causal diagram in Figure 18.5. The variable 
L is a post-treatment variable, but it can be used to block the backdoor path 
between treatment A and outcome Y. Therefore, the A-Y association adjusted 
for L is an unbiased estimator of the effect of A on Y, whereas the unadjusted 
A-Y association is a biased estimator. The take home message is that causal 
graphs do not care about temporal order. Thus, when A does not affect L, the 
analysis must be the same whether L is temporally before or temporally after 
A. 

The problem is that, even when we know the temporal order of the vari- 
ables, we cannot determine from the data whether or not A affects L. In fact, 
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Bias amplification is guaranteed if 
all the equations in the structural 
equation model corrresponding to 
the causal diagram are linear (Pearl 
2011), but may also occur in more 
realistic settings (Ding et al. 2017). 


An example of the application of ex- 
pert knowledge to adjustment was 
described by Hernán et al (2002). 


given the temporal ordering A L Y , any joint distribution of (A, L,Y ) without 
any independencies is compatible with several causal graphs. So the decision 
whether to adjust for L must be based on information outside of the data. 
That is, whether to adjust for L cannot be determined via any automated 
procedures that rely exclusively on statistical associations. For example, as 
discussed in Chapter 7, there is no way to distinguish a collider from a con- 
founder by using data only. Rather, the exclusion of bias-inducing variables 
from the adjustment set needs to be guided by subject-matter knowledge (if it 
exists) about the causal structure of the problem. 

We next turn to the question of adjustment for variables L that are tem- 
porally prior to treatment A, that is, our temporal ordering is now L A Y. 
Suppose, for simplicity, that the sample size is very large, greatly exceeding 
the number of covariates X available for adjustment. As a consequence, the 
variance of any estimator will be negligible and the only issue is bias. In this 
setting it is commonly believed that an estimator that adjusts for all available 
pretreatment covariates will minimize the bias. However, this belief is wrong 
for two separate reasons. 

Consider the causal diagram of Figure 18.6 (same as Figure 7.4), which 
includes a pre-treatment variable L. Because L is a collider on a path from 
A to Y, adjusting for it will introduce selection bias, which we referred to as 
M-bias in Chapter 7. Again, the observed data cannot distinguish between 
confounders and colliders, so one must rely on whatever external information 
one may have to decide whether or not to adjust for a pre-treatment variable L. 
In fact, it is also possible that L is both a confounder and a collider—if there 
were an arrow from L to A in Figure 18.6—which implies that the average 
causal effect cannot be identified, regardless of whether we do or do not adjust 
for L. 

There is one additional reason to avoid indiscriminate adjustment for pre- 
treatment variables: bias amplification, a phenomenon we have not yet de- 
scribed in this book. Consider the causal diagram of Figure 18.7 (same as 
Figure 16.1), which represents a setting in which the causal effect of treatment 
A on the outcome Y is confounded by the unmeasured variable U. Because 
U is not available in the data, we cannot adjust for U and the confounding 
is intractable. Adjustment for the variable Z—using the g-formula as above 
with L replaced by Z—does not eliminate confounding because Z is not on 
any backdoor path from the treatment A to the outcome Y. In fact, Z is an 
instrument—which can be used for instrumental variable estimation in some 
situations described in Chapter 16—and therefore useless for direct confound- 
ing adjustment by the g-formula. 

Interestingly, even though Z cannot be used to adjust away the confounding 
bias due to U, adjustment for the instrument Z can amplify the confounding 
bias due to U. That is, the A-Y association adjusted for Z may be further from 
the effect of A on Y than the A-Y association not adjusted for Z. This bias 
amplification due to adjusting for an instrument Z, often referred to as Z-bias, 
is a reason to avoid adjustment for variables that, like Z, are instruments. Bias 
amplification, however, is not guaranteed: adjustment for Z could also reduce 
the bias due to confounding by the unmeasured variable U. Generally, it is not 
possible to know whether adjustment for an instrument will amplify or reduce 
bias. 

In summary, even if we had no computational constraints and a quasi- 
infinite sample size, it is not advisable to adjust for all available variables X. 
Ideally, the adjustment set would not include any variables that may introduce 
or amplify bias. Because these bias-inducing variables cannot be empirically 
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18.3 Causal inference and 


The next three sections reflect 
Miguel Hernán’s informal interpre- 
tation (as of April 2019) of a set 
of theoretical lectures that James 
Robins delivered in Boston, Berlin, 
Rotterdam, and elsewhere beween 
2018 and early 2019. 


Remember that some of the vari- 
ables in X may not even be con- 
founders so we do not need to ad- 
just for them. 


Machine learning algorithms can 
use cross-validation (see Fine Point 
18.2) to optimize predictive accu- 
racy. 


Multiple authors have studied the 
problems of ad hoc or automatic 
variable selection for causal infer- 
ence. See Greenland (2008) for a 
list of citations. 


Variable selection for causal inference 


identified by purely statistical algorithms, expert knowledge is needed to guide 
variable selection. 


machine learning 


For the remainder of this chapter, we will assume that we have somehow suc- 
ceeded at ensuring that X includes no variables that may induce or amplify 
bias (i.e., no variables that would destroy conditional exchangeability if ad- 
justed for) while still including all confounders L of the average causal effect of 
Aon Y (i.e., all variables needed to achieve conditional exchangeability). Our 
next problem is to estimate this effect E ies) — E [Y=] in practice when 
X is very high-dimensional. 

Depending on the adjustment method that we choose, the variables X will 
be used in different ways. When using the plug-in g-formula (standardization), 
we will estimate the mean outcome Y conditional on the variables X, which 
we refer to as b(X); when using IP weighting, we will estimate the probability 
of treatment A conditional on the variables X, which we refer to as 7(X). We 
can produce estimates 6(x) and #(x) via the sort of parametric models (e.g., 
linear and logistic regression) that we have described in Part II of this book. To 
reduce the possibility of model misspecification, we will want to fit richly pa- 
rameterized models with multiple product terms and flexible functional forms 
(e.g., cubic splines) for the variables in X. 

The use of traditional parametric models encounters a serious constraint 
in many practical applications. The number of parameters in those models 
will be very large compared with, or larger than, the sample size n. However, 
traditional parametric models can only have a small number of parameters 
compared with the sample size. Therefore, fitting richly parametrized mod- 
els with terms for all variables in X will yield very imprecise b(x) and 7(x) 
estimates, or may actually be impossible under the usual large sample approx- 
imations used to fit linear or logistic models when the number of parameters in 
the model is greater than the number of individuals n in the dataset. Also, as 
discussed in Section 15.5, X may include non-confounders that are strongly as- 
sociated with the treatment A, which will result in quasi-violations of positivity 
when using methods that require fitting a model for 7(X). 

Possible ways forward are to fit the parametric models using the lasso (see 
Fine Point 18.1), a variable selection algorithm originally designed for predic- 
tive regression models, or to estimate the conditional expectations b(X) and 
m(X) using predictive machine learning algorithms such as tree-based algo- 
rithms (e.g., random forests) or neural networks (e.g., deep learning). In very 
high-dimensional databases, these and other machine learning algorithms can 
effectively incorporate thousands of parameters and thus outperform tradi- 
tional parametric models for the accurate prediction of conditional expecta- 
tions. But the use of predictive machine learning algorithms to estimate b(X) 
and 7(X) raises two serious concerns. 

First, machine learning algorithms do not guarantee that the selected vari- 
ables will eliminate confounding when using methods that require estimates 
of either the conditional mean outcome 6(X) (like the plug-in g-formula) or 
the probability of treatment 7(X) (like IP weighting). An improved approach 
is to use a doubly robust estimator that appropriately combines estimates of 
both b(X) and 7(X). Doubly robust estimators may succeed because their 
bias, unlike that of non-doubly robust estimators, depends on the product of 


18.4 Doubly robust machine learning estimators 


This property of doubly robust esti- 
mators is referred to as a second- 
order bias. See Technical Point 
13.2 for details. 


The degree of undercoverage will 
be greater when there is some de- 
gree of confounding in the super- 
population since, in that case, Wald 
confidence intervals will not be cen- 
tered on an unbiased estimator of 
the causal effect (see Chapter 10). 
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the error atay — aa in the estimation of aa with the error b(x) — 6(x) in the 
estimation of b(x). Therefore, doubly robust estimators may have small bias 
when they are based on accurate estimates b(x) and #(a) obtained via machine 
learning estimators. 

Second, machine learning algorithms are statistical black boxes with largely 
unknown statistical properties. That is, even if a doubly robust estimator is 
unbiased, the variance of the resulting estimate may be wrong. As a result, 
the calculated confidence intervals will lose their frequentist interpretation. 
Specifically, it is expected that 95% confidence intervals obtained via predictive 
machine learning will be too narrow and thus invalid: they fail to trap the 
causal parameter of interest at least 95% of the time. 

Thus, the use of doubly robust estimators is key to combining causal infer- 
ence with machine learning, but it is not sufficient. The next section describes 
two modifications to doubly robust estimation that tackle this second problem: 


sample splitting and cross-fitting. 


18.4 Doubly robust machine learning estimators 


We may refer to the training sample 
as the nuisance sample because we 
use it to estimate the nuisance re- 
gressions for b(X) and (X). Fine 
Point 15.1 reviews the concept of 
nuisance parameters. 


Let us suppose that the use of predictive machine learning algorithms results 
in a small bias for a doubly robust estimator. Small bias means that the bias of 
the estimate is much smaller than its standard error. More precisely, the bias 
has to be less than 1/,/n. A small bias is easier to achieve with doubly robust 
estimators than with non-doubly estimators because, again, the bias of doubly 
robust estimators is the product of the errors aa) — Fa) and b(x) — b(x). 
That is, even if each of the conditional expectations are estimated with an 
error larger than 1/,/n, the bias of the doubly robust estimator may still be 
sufficiently small for the construction of valid confidence intervals. 

Then, in large samples (i.e., asymptotically) and under some weak condi- 
tions, we can define a consistent doubly robust estimator that will follow a 
normal distribution with mean at the true value of the causal parameter. That 
is, we will be able to construct valid confidence intervals for the doubly robust 
estimator with small bias. For this to be true, the doubly robust estimator 
needs to incorporate two procedures, sample splitting and cross-fitting, which 
we describe below. 

We first describe sample splitting. First, we randomly divide the study pop- 
ulation of n individuals into two halves: an estimation sample and a training 
sample, each with n/2 individuals. Second, we apply the predictive algorithms 
to the training sample in order to obtain estimators of b(a) and #(«) for the 
conditional expectations of outcome and treatment, respectively. Third, we 
compute the doubly robust estimator of the average causal effect in the esti- 
mation sample using the estimators of b() and #(«) from the training sample. 
We have now obtained a doubly robust machine learning estimate of the aver- 
age causal effect in a random half of the study population. 

Sample splitting allows us to use standard statistical inference procedures 
based on the n/2 individuals in the estimation sample. Being able to construct 
a valid confidence interval is a good thing but, unfortunately, we lose half of the 
sample size in the process. As a result, our confidence interval will be wider 
than the one we would have obtained if we had been able to use the entire 
sample of n individuals. A way to overcome this problem is cross-fitting. 

We now describe how cross-fitting recovers the statistical efficiency lost by 
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Sample splitting and cross-fitting 
are not new procedures. However, 
the idea of combining these proce- 
dures with machine learning has not 
been emphasized until recently. 


Variable selection for causal inference 


sample splitting. First, we repeat the above procedure but swapping the roles 
of the estimation and training halves of the study population. That is, we 
use the half formerly reserved for estimation as the new training sample, and 
the half formerly used for training as the new estimation sample. We then 
compute the doubly robust estimator of the average causal effect in the new 
estimation sample using the estimators of b(x) and 7(x) from the new training 
sample. We have now obtained a doubly robust machine learning estimate of 
the average causal effect in the other random half of the population. 

The next step is to compute the average of the two doubly robust estimates 
from each half of the population. This average will be our doubly robust 
estimate of the effect in the entire study population. A 95% confidence interval 
around this estimate can be constructed by bootstrapping, either by adding 
and subtracting 1.96 times the bootstrap standard error or by using the 2.5 
and 97.5 percentiles of the bootstrap estimates as the bounds of the interval. 

We are done. Through sample splitting and cross-fitting, we can combine 
doubly robust estimation and machine learning to obtain causal effect estimates 
which have known statistical properties and which use all the available data. 
An active area of research is the development of procedures to detect whether 
the bias of doubly robust estimators is too large and, if so, to obtain estimates 
with smaller bias in the estimation sample without having to redo the machine 
learning component in the training sample. 


18.5 Variable selection is a difficult problem 


The methods outlined in the previous section invalidate the widespread belief 
that any data-based procedure to select adjustment variables will inevitably 
result in incorrect confidence intervals. As we have seen, the combination 
of causal inference methods with machine learning algorithms for confounder 
selection can, under certain conditions, result in correct statistical inference. 
However, doubly robust machine learning does not solve all our problems for 
at least four reasons. 

First, in many applications, the available subject-matter knowledge may be 
insufficient to identify all important confounders or to rule out variables that 
induce or amplify bias. Thus there is no guarantee that doubly robust machine 
learning estimators will have a small bias. 

Second, many machine learning algorithms are available but no algorithm 
is optimal in all settings. No mathematical theorem can show that one algo- 
rithm is generally better than another. The choice of algorithm should depend 
on the causal structure that gave rise to the data, but such causal structure 
may be unknown or hard to summarize for the development of practical rec- 
ommendations. 

Third, the implementation of doubly robust estimators is difficult—and 
computationally expensive when combined with machine learning—in high- 
dimensional settings with time-varying treatments. This is especially true for 
causal survival analysis. As a result, most published examples of causal infer- 
ence from complex longitudinal data use single robust estimators, which are 
the ones emphasized in Part III of this book. 

Fourth, doubly robust machine learning can yield a variance of the causal 
effect that equals the variance that would have been obtained if the true condi- 
tional expectations b(X) and 7(X) were known. However, there is no guarantee 
that such variance will be small enough for meaningful causal inference. 


18.5 Variable selection is a difficult problem 


This result raises a puzzling philo- 
sophical question: If the confidence 
interval is invalid when we use the 
data to rule out, say, 5 variables 
that make the variance too large, 
then why should the confidence in- 
terval be valid if we had happened 
to receive a dataset that did not in- 
clude those 5 variables? Given that 
we always work with datasets in 
which some potential confounders 
are not recorded, how should we in- 
terpret confidence intervals in any 
observational analysis? 
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Suppose that we obtain a doubly robust machine learning estimate of the 
causal effect, as described in the previous section, only to find out that its 
(correct) variance is too big to be useful. This will happen, even when we have 
done everything correctly, if some of the covariates in X are strongly associated 
with the treatment A. Then the estimate of the probability of treatment 7(X) 
may be near 0 or near 1 for individuals with a particular value of X. As a 
result, the effect estimate will have a very large variance and thus a very wide 
(but correct) 95% confidence interval. Since we do not like very wide 95% 
confidence intervals, even if they are correct, we may be tempted to throw out 
the variables in X that are causing the “problem” and then repeat the data 
analysis. If we did that, we would be fundamentally changing the game. Using 
the data to discard covariates in X that are associated with treatment, but 
not so much with the outcome, makes it no longer possible to guarantee that 
the 95% confidence interval around the effect estimate is valid. The tension 
between including all potential confounders to eliminate bias and excluding 
some variables to reduce the variance is hard to resolve. 

Given all of the above, developing a clear set of general guidelines for vari- 
able selection may not be possible. In fact, so much methodological research is 
ongoing around these issues that this chapter cannot possibly be prescriptive. 
The best scientific advice for causal inference may be to carry out multiple sen- 
sitivity analyses: implement several analytic methods and inspect the resulting 
effect estimates. If the various effect estimates are compatible, we will be more 
confident in the results. If the various effect estimates are not compatible, our 
job as researchers is to try to understand why. 
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Part III 


Causal inference from complex longitudinal data 


Chapter 19 
TIME-VARYING TREATMENTS 


So far this book has dealt with fixed treatments which do not vary over time. However, many causal questions 
involve treatments that vary over time. For example, we may be interested in estimating the causal effects of 
medical treatments, lifestyle habits, employment status, marital status, occupational exposures, etc. Because 
these treatments may take different values for a single individual over time, we refer to them as time-varying 
treatments. 

Restricting our attention to time-fixed treatments during Parts I and II of this book helped us introduce 
basic concepts and methods. It is now time to consider more realistic causal questions that involve the contrast 
of hypothetical interventions that are played out over time. Part III extends the material in Parts I and II to 
time-varying treatments. This chapter describes some key terminology and concepts for causal inference with 
time-varying treatments. Though we have done our best to simplify those concepts (if you don’t believe us, check 
out the causal inference literature), this is still one of the most technical chapters in the book. Unfortunately, 
further simplification would result in too much loss of rigor. But if you made it this far, you are qualified to 
understand this chapter. 


19.1 The causal effect of time-varying treatments 


Consider a time-fixed treatment variable A (1: treated, 0: untreated) at time 
zero of follow-up and an outcome variable Y measured 60 months later. We 
have previously defined the average causal effect of A on the outcome Y as the 
contrast between the mean counterfactual outcome Y°=! under treatment and 
the mean counterfactual outcome Y*~° under no treatment, that is, E ee - 
E [130] . Because treatment status is determined at a single time (time zero) 
for everybody, the average causal effect does not need to make reference to 
the time at which treatment occurs. In contrast, causal contrasts that involve 
time-varying treatments need to incorporate time explicitly. 
To see this, consider a time-varying dichotomous treatment A; that may 
For simplicity, we will provisionally change at every month k of follow-up, where k = 0,1,2...K with K = 59. 
assume that no individuals were lost For example, in a 5-year follow-up study of individuals infected with the hu- 
to follow-up or died during this pe- man immunodeficiency virus (HIV), A; takes value 1 if the individual receives 
riod, and we will also assume that antiretroviral therapy in month k, and 0 otherwise. No individuals received 
all variables were perfectly mea- treatment before the start of the study at time 0, i.e., A_; = 0 for all individ- 
sured. uals. 
We use an overbar to denote treatment history, that is, A, = (Ao, A1, ..-Ap) 
is the history of treatment from time 0 to time k. When we refer to the entire 
For compatibility with many pub- treatment history through K, we often represent Ax as A without a subscript. 
lished papers, we use zero-based in- In our HIV study, an individual who receives treatment continuously through- 
dexing for time. That is, the first out the follow-up has treatment history A (Ap = 1, A; = 1,...A59 = 1) = 
time of possible treatment isk =O (1,1,...1), or A = I. Analogously, an individual who never receives treat- 
rather than k = 1. ment during the follow-up has treatment history A = (0,0,...0) = 0. Most 
individuals are treated during part of the follow-up only, and therefore have 
intermediate treatment histories with some 1s and some 0s—which we cannot 
represent as compactly as I and 0. 
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To keep things simple, our exam- 
ple considers an outcome measured 
at a fixed time. However, the con- 
cepts discussed in this chapter also 
apply to time-varying outcomes and 
failure time outcomes. 


Remember: we use lower-case to 
denote possible realizations of a 
random variable, e.g., az is a re- 
alization of treatment Aù. 


19.2 Treatment strategies 


A general counterfactual theory to 
compare treatment strategies was 
first articulated by Robins (1986, 
1987, 1997a). 


Time-varying treatments 


Suppose Y measures health status—with higher values of Y indicating 
better health—at the end of follow-up at time K +1 = 60. We would 
like to estimate the average causal effect of the time-varying treatment A on 
the outcome Y. But we can no longer define the average causal effect of a 
time-varying treatment as a contrast at a single time k, because the contrast 
E [Y=] -E[Y*=°] quantifies the effect of treatment A, at a single time k, 
not the effect of the time-varying treatment A, at all times k between 0 and 
59. 

Indeed we will have to define the average causal effect as a contrast between 
the counterfactual mean outcomes under two treatment strategies that involve 
treatment at all times between the start (k = 0) and the end (k = K) of 
the follow-up. As a consequence, the average causal effect of a time-varying 
treatment is not uniquely defined. In the next section, we describe many 
possible definitions of average causal effect for a time-varying treatment. 


A treatment strategy—also referred to as a plan, policy, protocol, or regime— 
is a rule to assign treatment at each time k of follow-up. For example, two 
treatment strategies are “always treat” and “never treat” during the follow- 
up. The strategy “always treat” is represented by @ = (1,1,...1) = 1, and the 
strategy “never treat” is represented by a = (0,0, ...0) = 0. We can now define 
an average causal effect of A on the outcome Y as the contrast between the 


mean counterfactual outcome Y°=! under the strategy “always treat” and the 
mean counterfactual outcome Y°=° under the strategy “never treat”, that is, 
B[Y@=!] — B[y@9]. 

But there are many other possible causal effects for the time-varying treat- 
ment A, each of them defined by a contrast of outcomes under two particular 


treatment strategies. For example, we might be interested in the average causal 
effect defined by the contrast E [Y°|]—E lye] that compares the strategy “treat 


at every other month” @ = (1,0,1,0...) with the strategy “treat at all months 
except the first one” a’ = (0,1,1,1...). The number of possible contrasts is 
very large: we can define at least 2" treatment strategies because there are 2% 
possible combinations of values (ao, a1, ...aK) for a dichotomous a,x. In fact, as 
we next explain, these 2 such strategies do not exhaust all possible treatment 
strategies. 

To define even more treatment strategies in our HIV example, consider the 
time-varying covariate L which denotes CD4 cell count (in cells/ uL) measured 
at month k in all individuals. The variable Lẹ takes value 1 when the CD4 cell 
count is low, which indicates a bad prognosis, and 0 otherwise. At time zero, 
all individuals have a high CD4 cell count, Lo = 0. We could then consider the 
strategy “do no treat while Lẹ = 0, start treatment when Lẹ = 1 and treat 
continuously after that time”. This treatment strategy is different from the 
ones considered in the previous paragraph because we cannot represent it by 
a rule @ = (ao, @1,...aK) under which all individuals get the same treatment 
ao at time k = 0, a; at time k = 1, etc. Now, at each time, some individuals 
will be treated and others will be untreated, depending on the value of their 
evolving Lẹ. This is an example of a dynamic treatment strategy, a rule in 
which the treatment a, at time k depends on the evolution of an individual’s 
time-varying covariate(s) Lp. Strategies a for which treatment does not depend 
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Fine Point 19.1 


Deterministic and random treatment strategies. A dynamic treatment strategy is a rule g = 
[go (G@_1,lo),... JK (@x-1,lx)], where gk (G1, lx) specifies the treatment assigned at k to an individual with past 
history (āk-1,lx). An example in our HIV study: gx (G1, lx) is 1 if an individual's CD4 cell count (a function of lx) 
was low at or before k; otherwise gz, (āk-1, le) is 0. A static treatment strategy is a rule g = [go (@-1),--. gx (āg-1)], 
where gx (@x—1) does not depend on lp. We will often abbreviate Ik (@i—1, li) as g (Gx—1,lx)- 

Most static and dynamic strategies we are interested in comparing are deterministic treatment strategies, which 
assign a particular value of treatment (0 or 1) to each individual at each time. More generally, we could consider random 
treatment strategies that do not assign a particular value of treatment, but rather a probability of receiving a treatment 
value. Random treatment strategies can be static (e.g., “independently at each month, treat individuals with probability 
0.3 and do not treat with probability 0.7") or dynamic (e.g., “independently at each month, treat individuals whose 
CD4 cell count is low with probability 0.3, but do not treat individuals with high CD4 cell count” ). 

We refer to the strategy g for which the mean counterfactual outcome E [Y9] is maximized (when higher values 
of outcome are better) as the optimal treatment strategy. For a drug treatment, the optimal strategy will almost 
always be dynamic because treatment needs to be discontinued when toxicity develops. Also, no random strategy can 
ever be preferred to the optimal deterministic strategy. However, random strategies (i.e., ordinary randomized trials and 
sequentially randomized trials) remain scientifically necessary because, before the trial, it is unknown which deterministic 
regime is optimal. See Young et al. (2014) for a taxonomy of treatment strategies. In the text, except if noted otherwise, 
the letter g will refer only to deterministic treatment strategies. 


on covariates are non-dynamic or static treatment strategies. See Fine Point 
19.1 for a formal definition. 

Causal inference with time-varying treatments involves the contrast of coun- 
terfactual outcomes under two or more treatment strategies. The average 
causal effect of a time-varying treatment is only well-defined if the treatment 
strategies of interest are specified. In our HIV example, we can define an 


average causal effect based on the difference E[Y*] — E | that contrasts 


strategy @ (say, “always treat”) versus strategy @’ (say, “never treat”), or on 
the difference E[Y*] — E [Y9] that contrasts strategy @ (“always treat”) versus 
strategy g (say, “treat only after CD4 cell count is low”). Note we will often use 
g to represent any—static or dynamic—strategy. When we use it to represent 
a static strategy, we sometimes write Y9=@ rather than just Y9 or Y®. 

That is, there is not a single definition of causal effect for time-varying 
treatments. Even when only two treatment options—treat or do not treat— 
exist at each time k, we can still define as many causal effects as pairs of 
treatment strategies exist. In the next section, we describe a study design 
under which all these causal effects can be validly estimated: the sequentially 
randomized experiment. 


19.3 Sequentially randomized experiments 


The causal diagrams in Figures 19.1, 19.2, and 19.3 summarize three situations 
Recall that, by definition, a causal that can occur in studies with time-varying treatments. In all three diagrams, 
graph must always include all com- A, represents the time-varying treatment, Lp the set of measured variables, 
mon causes of any two variables on Y the outcome, and U; the set of unmeasured variables at k that are com- 
the graph. mon causes of at least two other variables on the causal graph. Because the 
covariates U, are not measured, their values are unknown and therefore un- 
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On the definition of dynamic strategies. Each dynamic strategy g = [90 (G1, lo)... 9x (@x—1,1K)]| that depends 
on past treatment and covariate history is associated with a dynamic strategy g’ = [gh (lo) ,. = Oe (lx)] that depends 
only on past covariate history. By consistency (see Technical Point 19.2), an individual will have the same treatment, 
covariate, and outcome history when following strategy g from time zero as when following strategy g’ from time 
zero. In particular, Y9 = Y% and L9(K) = L’ (K). Specifically, g’ is defined in terms of g recursively by gh (lo) = 
go (4-1 = 0, lo) (by convention, @—ı can only take the value zero) and gj, lk) = = gr [9 (ie 1)» lp]. For any strategy 
g for which treatment at each k already does not depend on past treatment history, g and g’ are the identical set of 
functions. The above definition of g’ in terms of g guarantees that an individual has followed strategy g through time t 
in the observed data, i.e., Ak = gk (Ak-1, Lk) for k < t, if and only if the individual has followed strategy g’ through 
t, i.e., Ak = gh (Lx) for k <t. 





available for the analysis. In our HIV study, the time-varying covariate CD4 
cell count Lẹ is a consequence of the true, but unmeasured, chronic damage to 
the immune system Ux. The greater an individual’s immune damage Up, the 
lower her CD4 cell count Lẹ and her health status Y. For simplicity, the causal 


diagrams include only the first two times of follow-up k = 0 and k = 1, and we 
aS will assume that all participants adhered to the assigned treatment (see Fine 
[Fee _ Point 19.2). 


Ly Aoo L; A — Y 
The causal diagram in Figure 19.1 lacks arrows from either the measured 


f a < a covariates Ly or the unmeasured covariates U% into treatment Agp. The causal 
Us me diagram in Figure 19.2 has arrows from the measured covariates Lp, but not 

from the unmeasured covariates U;, into treatment Ap. The causal diagram 
in Figure 19.3 has arrows from both the measured covariates Ly and the un- 
measured covariates U; into treatment Ax. 


Figure 19.1 


Figure 19.1 could represent a randomized experiment in which treatment 

A, at each time k is randomly assigned with a probability that depends only on 

= prior treatment history. Our HIV study would be represented by Figure 19.1 if, 

i e ANEN v for example, an individual’s treatment value at each month k were randomly 
7 j : ' assigned with probability 0.5 for individuals who did not receive treatment 

f Led o during the previous month (A;_; = 0), and with probability 1 for individuals 
Via A, who did receive treatment during the previous month k (Ak-1 = 1). When 
interested in the contrast of static treatment strategies, Figure 19.1 is the 
proper generalization of no confounding by measured or unmeasured variables 
for time-varying treatments. Under this causal diagram, the counterfactual 
outcome mean E [Y7] if everybody had followed the static treatment strategy 


ā is simply the mean outcome E [Y|A = a] among those who followed the 
BSS strategy G@. (Interestingly, the same is not true for dynamic strategies. The 


P K L, counterfactual mean E [Y9] under a dynamic strategy g that depends on the 


variables L is only the mean outcome among those who followed the strategy 
i aoo A g if the probability of receiving treatment A, = 1 is exactly 0.5 at all times k 
U, A ee Y) ė = A P in e A 


Figure 19.2 


at which treatment A; depends on Ly. Otherwise, identifying E [Y9] requires 
the application of g-methods to data on L, A, and Y under either Figure 19.1 
Figure 19.3 or Figure 19.2.) 

Figure 19.2 could represent a randomized experiment in which treatment A; 
at each time k is randomly assigned by the investigators with a probability that 
depends on prior treatment and measured covariate history. Our study would 
be represented by Figure 19.2 if, for example, an individual’s treatment value 
at each month k were randomly assigned with probability 0.4 for untreated 
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Fine Point 19.2 


Per-protocol effects to compare treatment strategies. Many randomized trials assign individuals to a treatment 
at baseline with the intention that they will keep taking it during the follow-up, unless the treatment becomes toxic or 
otherwise contraindicated. That is, the protocol of the trial implicitly or explicitly aims at the comparison of dynamic 
treatment strategies, and the per-protocol effect (introduced in Section 9.5) is the effect that would have been observed 
if everybody had adhered to their assigned treatment strategy. 

For example, the goal of a trial of statin therapy among healthy individuals may be the comparison of the dynamic 
strategies “initiate statin therapy at baseline and keep taking it during the study unless rhabdomyolysis occurs” versus 
“do not take statin therapy during the study unless LDL-cholesterol is high or coronary heart disease is diagnosed.” 
Estimating the per-protocol effect in this randomized trial raises the same issues as any comparison of treatment 
strategies in an observational study. Specifically, valid estimation of the per-protocol effect generally demands that trial 
investigators collect post-randomization data on adherence to the strategy and on (time-varying) prognostic factors 
associated with adherence (Hernán and Robins 2017). Baseline randomization makes us expect baseline exchangeability 
for the assigned treatment strategy, not sequential exchangeability for the strategy that is actually received. 





individuals with high CD4 cell count (A,_; = 0, Lk = 1), 0.8 for untreated 
individuals with low CD4 cell count (A,_1 = 0, Lk = 0), and 0.5 for previously 
treated individuals, regardless of their CD4 cell count (A,_1 = 1). In Figure 
19.2, there is confounding by measured, but not unmeasured, variables for the 
time-varying treatment. 


An experiment in which treatment is randomly assigned at each time k to 
each individual is referred to as a sequentially randomized experiment. There- 
fore Figures 19.1 and 19.2 could represent sequentially randomized experi- 
ments. On the other hand, Figure 19.3 cannot represent a randomized experi- 
ment: the value of treatment A; at each time k depends partly on unmeasured 
variables U which are causes of L and Y, but unmeasured variables obviously 
cannot be used by investigators to assign treatment. That is, a sequentially 
randomized experiment can be represented by a causal diagram with many time 
points k = 0, 1...K and with no direct arrows from the unmeasured prognostic 
factors U into treatment A; at any time k. 


In observational studies, decisions about treatment often depend on out- 
come predictors such as prognostic factors. Therefore, observational studies 
will be typically represented by either Figure 19.2 or Figure 19.3 rather than 
Figure 19.1. For example, suppose our HIV follow-up study were an observa- 
tional study (not an experiment) in which the lower the CD4 cell count Lx, the 
more likely a patient is to be treated. Then our study would be represented by 
Figure 19.2 if, at each month k, treatment decisions in the real world were made 
based on the values of prior treatment and CD4 cell count history (Az_1, Lz), 
but not on the values of any unmeasured variables U;,. Thus, an observational 
study represented by Figure 19.2 would differ from a sequentially randomized 
experiment only in that the assignment probabilities are unknown (but could 
be estimated from the data). Unfortunately, it is impossible to show empiri- 
cally whether an observational study is represented by the causal diagram in 
either Figure 19.2 or Figure 19.3. Observational studies represented by Figure 
19.3 have unmeasured confounding, as we describe later. 

Sequentially randomized experiments are not frequently used in practice. 
However, the concept of sequentially randomized experiment is helpful to un- 
derstand some key conditions for valid estimation of causal effects of time- 
varying treatments. The next section presents these conditions formally. 
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19.4 Sequential exchangeability 


For those with ob- 
served treatment history 
[Ao = g(Lo), Ar = g(Ao, Lo, £1)] 
equal to (i.e., compatible with) the 
treatment they would have received 
under strategy g through the end of 
follow-up, the counterfactual out- 
come Y¥ is equal to the observed 
outcome Y and therefore also to 
the counterfactual outcome under 
the strategy ao = Ao, ai = Aj. 


In Figure 19.1, sequential uncondi- 
tional exchangeability for Y holds, 
that is, 

Y@1LA;|Ap—1 = Gx_1 for all sta- 
tic strategies @. Unconditional ex- 
changeability implies that associa- 
tion is causation, ie, E[Y?] = 
E[y|4 =a]. 





Whenever we talk about identifica- 
tion of causal effects, the identify- 
ing formula will be the g-formula. 
In rare cases not relevant to our dis- 
cussion, effects can be identified by 
formulas that are related to, but not 
equal to, the g-formula (e.g., Tech- 
nical Point 7.3). 


As described in Parts I and II, valid causal inferences about time-fixed treat- 
ments typically require conditional exchangeability Y° 1 A|L. When exchange- 
ability Y° 1LA|L holds, we can obtain unbiased estimates of the causal effect of 
treatment A on the outcome Y if we appropriately adjust for the variables in L 
via standardization, IP weighting, g-estimation, or other methods. We expect 
conditional exchangeability to hold in conditionally randomized experiments— 
a trial in which individuals are assigned treatment with a probability that de- 
pends on the values of the covariates L. Conditional exchangeability holds in 
observational studies if the probability of receiving treatment depends on the 
measured covariates L and, conditional on L, does not further depend on any 
unmeasured, common causes of treatment an outcome. 

Similarly, causal inference with time-varying treatments requires adjusting 
for the time-varying covariates Lẹ to achieve conditional exchangeability at 
each time point, that is, sequential conditional exchangeability. For example, 
in a study with two time points, sequential conditional exchangeability is the 
combination of conditional exchangeability at both the first time and the sec- 
ond time of the study. That is, Y91LAo|Lo and Y91LA1|Ao = g(Lo), Lo, Lı. 
(For brevity, in this book we drop the word “conditional” and simply say se- 
quential exchangeability.) We will refer to this set of conditional independences 
as sequential exchangeability for Y9 under any—static or dynamic—strategy g 
that involves interventions on both components of the time-varying treatment 
(Ao, A1). 

A sequentially randomized experiment—an experiment in which treatment 
Akp at each time k is randomly assigned with a probability that depends only 
on the values of their prior covariate history Lẹ and treatment history A,_,— 
implies sequential exchangeability for Y9. That is, for any strategy g, the 
treated and the untreated at each time k are exchangeable for Y9 conditional 
on prior covariate history Ly and any observed treatment history Akı = 
g(Ap_2, Lk—1) compatible with strategy g. Formally, sequential exchangeability 
for Y? is defined as 


Y9 1 Ap|Ap—1 = g(Ap—2, Lk—1), Lp for all strategies g and k = 0,1...K 


This form of sequential exchangeability (there are others, as we will see) 
always holds in any causal graph which, like Figure 19.2, has no arrows from 
the unmeasured variables U into the treatment variables A. Therefore sequen- 
tial exchangeability for Y9 holds in sequentially randomized experiments and 
observational studies in which the probability of receiving treatment at each 
time depends on their treatment and measured covariate history (Ag-1, Lx) 
and, conditional on this history, does not depend on any unmeasured causes 
of the outcome. 

That is, in observational studies represented by Figure 19.2 the mean of 
the counterfactual outcome E [Y9] under all strategies g is identified, whereas 
in observational studies represented by Figure 19.3 no mean counterfactual 
outcome E[Y¥] is identified. In observational studies represented by other 
causal diagrams, the mean counterfactual outcome E [Y9] under some but not 
all strategies g is identified. 

For example, consider an observational study represented by the causal 
diagram in Figure 19.4, which includes an unmeasured variable Wo. In our 
HIV example, Wọ could be an indicator for a scheduled clinic visit at time 
0 that was not recorded in our database. In that case Wo would be a cause 
shared by treatment Ao and obtaining a (somewhat noisy) measurement Lı of 
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Technical Point 19.2 


Positivity and consistency for time-varying treatments. The positivity condition needs to be generalized from the 
fixed version “if fz (1) £0, faz (all) > 0 for all a and /” to the sequential version 


If Tie alg (G1, li) # 0, then TAA ac (a;,|Gx—15 lx) > 0 for all (Gx. lx) 


In a sequentially randomized experiment, positivity will hold if the randomization probabilities at each time k are never 
either 0 nor 1, no matter the past treatment and covariate history. If we are interested in a particular strategy g, the 
above positivity condition needs to only hold for treatment histories compatible with g, i.e., for each k, a, = g Cur Ik). 

The consistency condition also needs to be generalized from the fixed version “If A = a for a given individual, then 


Y°“ = Y for that individual” to the sequential version 
Y% = Y” ifa =; Y? =Y if A= a; = LP ifa = āra; L = Ly if Ay_1 = āki 


where L® is the counterfactual L-history through time k under strategy @. Technically, the identification of effects of 
time-varying treatments on Y requires weaker consistency conditions: “If A = @ for a given individual, then Y? = Y 
for that individual” is sufficient for static strategies, and “For any strategy g, if Ak = gx (Ak-1, Lr) at each time k for 
a given individual, then Y9 = Y” is sufficient for dynamic strategies. However, the stronger sequential consistency is a 
natural condition that we will always accept. 

Note that, if we expect that the interventions “treat in month k” corresponding to Az = 1 and “do not treat in 
month k” corresponding to A; = 0 are sufficiently well defined at all times k, then all static and dynamic strategies 
involving A; will be similarly well defined. 


CD4 cell count, with U representing the underlying but unknown true value 
of CD4 cell count. Even though Wo is unmeasured, the mean counterfactual 


outcome is still identified under any static strategy g = ā; however, the mean 
SS counterfactual outcome E [Y9] is not identified under any dynamic strategy g 
2 ~S with treatment assignment depending on Lı. To illustrate why identification 


Ly —> Av L; A Y 


t is possible under some but not all strategies, we will use SWIGs in the next 
Wo Í Ta section. 
Us U 


In addition to some form of sequential exchangeability, causal inference 
involving time-varying treatments also requires a sequential version of the con- 
Figure 19.4 ditions of positivity and consistency. In a sequentially randomized experiment, 
both sequential positivity and consistency are expected to hold (see Technical 
Point 19.2). Below we will assume that sequential positivity and consistency 
hold. Under the three identifiability conditions, we can identify the mean coun- 
terfactual outcome E [Y9] under a strategy of interest g as long as we use meth- 
ods that appropriately adjust for treatment and covariate history (Az—1, Li), 
such as the g-formula (standardization), IP weighting, and g-estimation. 


19.5 Identifiability under some but not all treatment strategies 


Pearl and Robins (1995) proposed In Chapter 7, we presented a graphical rule—the backdoor criterion—to assess 
a generalized backdoor criterion for whether exchangeability holds for a time-fixed treatment under a particular 
static strategies. Robins (1997) ex- causal diagram. The backdoor criterion can be generalized for time-varying 
tended the procedure to dynamic treatments. For example, for static strategies, a sufficient condition for iden- 
strategies. tification of the causal effect of treatment strategies is that, at each time k, 

all backdoor paths into A; that do not go through any future treatment are 
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Figure 19.5 


Figure 19.6 


ns 
Alag —> LY 3 A%| a, — ¥%0% 


U, 


Figure 19.7 


Yor IL Al|Ag = ao, LI equals 














Y%- || A |Ao = do, Lı because, 


by consistency, LP = 
Aj? = Ay when Ao = ao. 


Lı and 


Time-varying treatments 


blocked. 

However, the generalized backdoor criterion does not directly show the con- 
nection between blocking backdoor paths and sequential exchangeability, be- 
cause the procedure is based on causal directed acyclic graphs that do not 
include counterfactual outcomes. An alternative graphical check for identifia- 
bility of causal effects is based on SWIGs, also discussed in Chapter 7. SWIGs 
are especially helpful for time-varying treatments. 

Consider the causal diagrams in Figures 19.5 and 19.6, which are simplified 
versions of those in Figures 19.2 and 19.4. We have omitted the nodes Up and 
Lo and the arrow from Ap to U1. In addition, the arrow from Lı to Y is absent 
so L; is no longer a direct cause of Y. Figures 19.5 and 19.6 (like Figures 19.2 
and 19.4) differ in whether A; and subsequent covariates L; for t > k share a 
cause Wp. 

As discussed in Part I of this book, a SWIG represents a counterfactual 
world under a particular intervention. The SWIG in Figure 19.7 represents the 
world in Figure 19.5 if all individuals had received the static strategy (ao, a1), 
where ag and a; can take values 0 or 1. For example, Figure 19.7 can be used 
to represent the world under the strategy “always treat” (ap = 1,a1 = 1) or 
under the strategy “never treat” (ao = 0,a1 = 0). To construct this SWIG, we 
first split the treatment nodes Ap and A,. The right side of the split treat- 
ments represents the value of treatment under the intervention. The left side 
represents the value of treatment that would have been observed when inter- 
vening on all previous treatments. Therefore, the left side of Ag is precisely 
Ao because there are no previous treatments to intervene on, and the left side 
of A, is the counterfactual treatment A{° that would be observed after setting 
Ag to the value ap. All arrows into a given treatment in the original causal 
diagram now point into the left side, and all arrows out of a given treatment 
now originate from the right side. The outcome variable is the counterfac- 
tual outcome Y°"1 and the covariates L are replaced by their corresponding 
counterfactual variables. Note that we write the counterfactual variable cor- 
responding to Lı under strategy (ao,a1) as LIP, rather than L{°“’, because a 
future intervention on A, cannot affect the value of earlier L4. 

Unlike the directed acyclic graph in Figure 19.5, the SWIG in Figure 19.7 
does include the counterfactual outcome, which means that we can visually 
check for exchangeability using d-separation. 

In Figure 19.7, we can use d-separation to show that both Y% LL Ag and 
Yo ILAP |Ao, LI hold for any static strategy (ao, a1). Note that this second 
conditional independence holds even though there seems to be an open path 
AP — ap > LP — Uy — Y%™., However, this path is actually blocked for 
the following reason. In the counterfactual world, ao is a constant and in prob- 
ability statements constants are always implicitly conditioned on even though, 
by convention, they are not shown in the conditioning event. However, when 
checking d-separation we need to remember that constants are conditioned on, 
blocking the above path. 

The second conditional independence Y% LL Af°|Ao, L{° implies, by de- 
finition, Y" LLA{°|Ao = ao, Li? in the subset of individuals who received 
treatment Ag = ag. Therefore, by consistency, we conclude that Y°:% IL Ao 
and Y%:“ {| A |Ao = ao, Lı holds under the causal diagram in Figure 19.5, 
which corresponds to the SWIG in Figure 19.7 where we can actually check 
for exchangeability. If there were multiple time points, we would say that 





Y7 1 Ax| Ax 1 = ük 1, diy for k = 0, 1...K 





We refer to the above condition as static sequential exchangeability for Y°, 
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Technical Point 19.3 


The many forms of sequential exchangeability. Consider a sequentially randomized experiment of a time-varying 
treatment A; with multiple time points k = 0, 1, ...K. The SWIG that represents this experiment is just a longer version 
of Figure 19.7. The following conditional independence can be directly read from the SWIG: 


(Y7, Lya) il Aa Age Suk 1 
where Eh is the counterfactual covariate ney from time k + 1 through the end of follow-up. The above conditional 


independence implies (Y7, JA) i Ar HATE AE pi for the particular instance A = ūk—1, with G_1 
being a component of strategy T. Becak of consistency, the last conditional independence statement equals 


(Y7, Lyi) ILAk| Ak 1 = Qk 1, Ik 





When this statement holds for all œ, we say that there is sequential exchangeability. Interestingly, even though this 
sequential exchangeability condition only refers to static strategies g = G, it is equivalent to the seemingly stronger 


(Y9, Ll1) AL Ap|Ap—1 =g (Ak-1, Lk) Lk for all g, 


and, if positivity holds, is therefore sufficient to identify the outcome and covariate distribution under any static and dy- 
namic strategies g (Robins 1986). This identification results from the joint conditional independence between (Y7, Ly) 
and A;. Note that, for dynamic strategies, sequential exchangeability does not follow from the separate independences 
y. LI A, |Ar 1 = k Lr and Lya LI A, | Ar 1 = ap 1, p. 

Stronger conditional independences are expected to hold in a sequentially randomized experiment, but they 
(i) cannot be read from SWIGs and (ii) are not necessary for identification of the causal effects of treatment 
strategies in the population. For example, a sequentially randomized trial implies the stronger joint independence 
Tys Lia all a} 4ILA,|Ag—1, Ly. 

An even stronger condition that is expected to hold in sequentially randomized experiments is 








ae ZA) ‘i Agana 


where, for a dichotomous treatment Ax, A denotes the set of all 2¥ static strategies G, YA denotes the set of all 
counterfactual outcomes Y7, and L^ denotes the set of all counterfactual covariate histories. Using a terminology 
analogous to that of Technical Point 2.1, we refer to this joint independence condition as full sequential exchangeability. 


which is weaker than sequential exchangeability for Y9, because it only re- 
quires conditional independence between counterfactual outcomes Y” indexed 
by static strategies g = G@ and treatment Ap. Static sequential exchangeabil- 
ity is sufficient to identify the mean counterfactual outcome under any static 
strategy g = a. See also Technical Point 19.3. 

Static sequential exchangeability also holds under the causal diagram in 
Figure 19.6, as can be checked by applying d-separation to its corresponding 
SWIG in Figure 19.8. Therefore, in an observational study represented by 
Figure 19.6, we can identify the mean counterfactual outcome under any static 

| strategy (ao, a1). 
W, U; Let us return to Figure 19.5. Let us now assume that the arrow from L1 
to A; were missing. In that case, the arrow from L{° to A{° would also be 
missing from the SWIG in Figure 19.7. It would then follow by d-separation 
that sequential exchangeability holds unconditionally for Ap and conditionally 
on Ao for AÑ’, and therefore that the mean counterfactual outcome under any 
static strategy could be identified without data on Lı. Now let us assume that, 


Jo 
Alag — 12° A%| a, — ¥%0% 


Figure 19.8 
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Fine Point 19.3 


Dynamic strategies that depend on baseline covariates. For simplicity, the causal graphs depicted in this chapter 
do not include a baseline confounder Lo. If we included Lo in Figure 19.9, then we could have considered a strategy 
in which the random variable representing the intervention go(Lo) replaces go. Then, when checking d-separation 
between A? and Y9 on the graph, Y91L A%|Ao, go(Lo), Lo, LY, we need to condition on the entire past, including 
go(Lo). If we instantiate this expression at Ag = go(Lo), then the intervention variable can be removed from the 
conditioning event because go(Lo) is now equal to the observed Ag and thus is redundant. That is, we have now 
Y91L A¥| Ao = go(Lo), Lo, L? which, by consistency, is Y9 1LA1|Ao = go(Lo), Lo, Li. This conditional independence is 
sequential exchangeability for Y9 and treatment A, when there is also a baseline confounder Lo. 





in Figure 19.5, there was an arrow from U; to Aı. Then the SWIG in Figure 
19.7 would include an arrow from U; to A{°, and that no form of sequential 
exchangeability would hold. Therefore the counterfactual mean would not be 
identified under any strategy. 

We now discuss the SWIGs for Figures 19.5 and 19.6 under dynamic regimes. 
The SWIG in Figure 19.9 represents the world of Figure 19.5 under a dynamic 
treatment strategy g = [go, gı (L1)] in which treatment Apo is assigned a fixed 





U, value go (either 0 or 1), and treatment A; at time k = 1 is assigned a value 
gi(L{) that depends on the value of L] that was observed after having as- 
Figure 19.9 signed treatment value go at time k = 0. For example, g may be the strategy 


“do no treat at time 0, treat at time 1 only if CD4 cell count is low, i.e., if 
L? = 1”. Under this strategy go = 0 for everybody, and gi(L%) = 1 when 
L] = 1 and gi(L{) = 0 when L{ = 0. Therefore the SWIG includes an arrow 
from LY to gi(L%). This arrow was not part of the original causal graph; it is 
the result of the intervention. We therefore draw this arrow differently from 
the others, even though we need to treat it as any other arrow when evaluating 
ee d-separation. The outcome in the SWIG is the counterfactual outcome Y9 
Alg —— Lf > Aq| g(L{)—>Y2 under the dynamic treatment strategy g. 

7 By applying d-separation to the SWIG in Figure 19.9, we find that both 
Y91LAp and Y91LA|Ao = go, L{ hold for any strategy g. That is, sequential 
exchangeability for Y9 holds, which means that we can identify the mean 
Figure 19.10 counterfactual outcome under all strategies g (see also Fine Point 19.3). This 

result, however, does not hold for the causal diagram in Figure 19.6. 





Technically, what we read from the The SWIG in Figure 19.10 represents the world of Figure 19.6 under a 

SWIG is Y9 LL A¥|Ao, L] which, by dynamic treatment strategy g = [go, 91(L1)]. By applying d-separation to the 

consistency, implies Y91LA;|Ag = SWIG in Figure 19.10, we find that Y91LAg does not hold because of the open 

go, La path Ag — Wo > L] —> gi(L4) + Y°. That is, sequential exchangeability for 
Y9 does not hold, which means that we cannot identify the mean counterfactual 
outcome for any strategy g. 


In summary, in observational studies (or sequentially randomized trials) 
represented by Figure 19.5, sequential exchangeability for Y9 holds, and there- 
fore the data can be used to validly estimate causal effects involving static and 
dynamic strategies. On the other hand, in observational studies represented by 
Figure 19.6, only the weaker condition for static strategies holds, and therefore 
the data can be used to validly estimate causal effects involving static strate- 
gies, but not dynamic strategies. Another way to think about this is that in 
the counterfactual world represented by the SWIG in Figure 19.10, the distri- 
bution of Y9 depends on the distribution of g;(L4) and thus of L]. However, 
the distribution of L{ is not identifiable due to the path Ag — Wo > Li. 
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One last example. Consider Figure 19.11 which is equal to Figure 19.6 
except for the presence of an arrow from Lı to Y, and its corresponding SWIG 
under a static strategy in Figure 19.12. We can use d-separation to show that 
neither sequential exchangeability for Y9 nor static sequential exchangeability 
for Y” hold. Therefore, in observational study represented by Figure 19.11, we 
cannot use the data to validly estimate causal effects involving any strategies. 
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Figure 19.11 
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a a 
Aola —> LP Ay?| a, — YO 
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Figure 19.12 


A second backdoor path gets open 
after conditioning on collider L4: 
A, — Ap — Lı — U — Y 
This second backdoor path can be 
safely blocked by conditioning on 
prior treatment Ag, assuming it is 
available to investigators. 





No form of sequential exchangeability is guaranteed to hold in observational 
studies. Achieving approximate exchangeability requires expert knowledge, 
which will guide investigators in the design of their studies to measure as 
many of the relevant variables Lẹ as possible. For example, in an HIV study, 
experts would agree that time-varying variables like CD4 cell count, viral load, 
symptoms need to be appropriately measured and adjusted for. 

But the question “Are the measured covariates sufficient to ensure sequen- 
tial exchangeability?” can never be answered with certainty. Yet we can use 
our expert knowledge to organize our beliefs about exchangeability and rep- 
resent them in a causal diagram. Figures 19.1 to 19.4 are examples of causal 
diagrams that summarize different scenarios. Note that we drew these causal 
diagrams in the absence of selection (e.g., censoring by loss to follow-up) so 
that we can concentrate on confounding here. 

Consider Figure 19.5. Like in Part I of this book, suppose that we are 
interested in the effect of the time-fixed treatment A; on the outcome Y. We 
say that there is confounding for the effect of A; on Y because A; and Y 
share the cause U, i.e., because there is an open backdoor path between A; 
and Y through U. To estimate this effect without bias, we need to adjust for 
confounders of the effect of the treatment A, on the outcome Y, as explained 
in Chapter 7. In other words, we need to be able to block all open backdoor 
paths between A; and Y. This backdoor path A, — Lı — U —- Y cannot 
be blocked by conditioning on the common cause U because U is unmeasured 
and therefore unavailable to the investigators. However, this backdoor path 
can be blocked by conditioning on Lı, which is measured. Thus, if the investi- 
gators collected data on Lı for all individuals, there would be no unmeasured 
confounding for the effect of A;. We then say that Lı is a confounder for 
the effect of A,, even though the actual common cause of A, and Y was the 
unmeasured U (re-read Section 7.3 if you need to refresh your memory about 
confounding and causal diagrams). 

As discussed in Chapter 7, the confounders do not have to be direct causes 
of the outcome. In Figure 19.5, the arrow from the confounder Lı to the 
outcome Y does not exist. Then the source of the confounding (i.e., the causal 
confounder) is the unmeasured common cause U. Nonetheless, because data 
on Lı suffice to block the backdoor paths from A; to Y and thus to control 
confounding, we refer to Lı as a confounder for the effect of A; on Y. 

Now imagine the very long causal diagram that contains all time points 
k = 0,1,2..., and in which Lp affects subsequent treatments Ap, Ak+1... and 
shares unmeasured causes U; with the outcome Y. Suppose that we want to 
estimate the causal effects on the outcome Y of treatment strategies defined 
by interventions on Ag, Aj, Ag. Then, at each time k, the covariate history Ly 
will be needed, together with the treatment history A;,—1, to block the back- 
door paths between treatment A; and the outcome Y. Thus, no unmeasured 
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Fine Point 19.4 


A definition of time-varying confounding. In the absence of selection bias, we say there is confounding for causal 
effects involving E[Y°] if E[Y°] 4 E[Y|A = Gl, that is, if the mean outcome had, contrary to fact, all individuals in 
the study followed strategy a differs from the mean outcome among the subset of individuals who followed strategy a 
in the actual study. 

We say the confounding is solely time-fixed (i.e., wholly attributable to baseline covariates) if E[Y@|Lo] = 
E[Y|A = 4, Lo], as would be the case if the only arrows pointing into A, in Figure 19.2 were from Ag and Lo. In 
contrast, if the identifiability conditions hold, but E[Y“|Lo] 4 E[Y|A = 4G, Lo], we say that time-varying confounding 
is present. If the identifiability conditions do not hold, as in Figure 19.3, we say that there is unmeasured confounding. 

A sufficient condition for no time-varying confounding is unconditional sequential exchangeability for Y7, that is, 
Y@ 1 A,|Ap_1 = G1. This condition holds in sequentially randomized experiments, like the one represented in Figure 
19.1, in which treatment A; at each time k is randomly assigned with a probability that depends only on the values of 
prior treatment history A,—1. In fact, the causal diagram in Figure 19.1 can be greatly simplified. To do so, first note 
that Lı is not a common cause of any two nodes in the graph so it can be omitted from the graph. Once L; is gone, 
then both Lo and U; can be omitted too because they cease to be common causes of two nodes in the graph. In the 
graph without Lo, Lı, and U, the node Up can be omitted too. That is, the causal diagram in Figure 19.1 can be 
simplified to include only the nodes Ag, A; and Y. 





confounding for the effect of A requires that the investigators collected data 

on Ly for all individuals. We then say that the time-varying covariates in Ly, 
Time-varying confounders are are time-varying confounders for the effect of the time-varying treatment A on 
sometimes referred to as time- Y at several (or, in our example, all) times k in the study. See Fine Point 19.4 
dependent confounders. for a more precise definition of time-varying confounding. 

Unfortunately, we cannot empirically confirm that all confounders, whether 
time-fixed or time-varying, are measured. That is, we cannot empirically dif- 
ferentiate between Figure 19.2 with no unmeasured confounding and Figure 
19.3 with unmeasured confounding. Interestingly, even if all confounders were 
correctly measured and modeled, most adjustment methods may still result in 
biased estimates when comparing treatment strategies. The next chapter ex- 
plains why g-methods are the appropriate approach to adjust for time-varying 
confounders. 


Chapter 20 


TREATMENT-CONFOUNDER FEEDBACK 


The previous chapter identified sequential exchangeability as a key condition to identify the causal effects of time- 
varying treatments. Suppose that we have a study in which the strongest form of sequential exchangeability holds: 
the measured time-varying confounders are sufficient to validly estimate the causal effect of any treatment strategy. 
Then the question is what confounding adjustment method to use. The answer to this question highlights a key 
problem in causal inference about time-varying treatments: treatment-confounder feedback. 

When treatment-confounder feedback exists, using traditional adjustment methods may introduce bias in 
the effect estimates. That is, even if we had all the information required to validly estimate the average causal 
effect of any treatment strategy, we would be generally unable to do so. This chapter describes the structure of 
treatment-confounder feedback and the reasons why traditional adjustment methods fail. 


20.1 The elements of treatment-confounder feedback 


CBS, 


1 1 


i= 


Figure 20.1 


Ae, 


U= U, 


Figure 20.2 


Consider again the sequentially randomized trial of HIV-positive individuals 
that we discussed in the previous chapter. For every person in the study, we 
have data on treatment A; (1: treated, 0: untreated) and covariates Lẹ at each 
month of follow-up k = 0, 1, 2...K, and on an outcome Y that measures health 
status at month kK + 1. The causal diagram in Figure 20.1, which is equal 
to the one in Figure 19.2, represents the first two months of the study. The 
time-varying covariates L, are time-varying confounders. (As in the previous 
chapter, we are using this example without censoring so that we can focus on 
confounding.) 

Something else is going on in Figure 20.1. Not only is there an arrow from 
CD4 cell count Ly, to treatment Ag, but also there is an arrow from treatment 
Ax—1 to future CD4 cell count L;,—because receiving treatment A,_1 increases 
future CD4 cell count Ly. That is, the confounder affects the treatment and 
the treatment affects the confounder. There is treatment-confounder feedback 
(see also Fine Point 20.1). 

Note that time-varying confounding can occur without treatment-confounder 
feedback. The causal diagram in Figure 20.2. is the same as the one in 
Figure 20.1, except that the arrows from treatment A,_; to future Lẹ and 
Up have been deleted. In a setting represented by this diagram, the time- 
varying covariates L;, are time-varying confounders, but they are not affected 
by prior treatment. Therefore, there is time-varying confounding, but there is 
no treatment-confounder feedback. 

Treatment-confounder feedback creates an interesting problem for causal 
inference. To state the problem in its simplest form, let us simplify the causal 
diagram in Figure 20.1 a bit more. Figure 20.3 is the smallest subset of Figure 
20.1 that illustrates treatment-confounder feedback in a sequentially random- 
ized trial with two time points. When drawing the causal diagram in Figure 
20.3, we made two simplifications: 


e Because our interest is in the implications of confounding by Lı, we 
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Fine Point 20.1 


Representing feedback cycles with acyclic graphs. Interestingly, an acyclic graph—like the one in Figure 20.1—can 
be used to represent a treatment-confounder feedback loop or cycle. The trick to achieve this visual representation is 
to elaborate the treatment-confounder feedback loop in time. That is, A,_1 —> Lk — Ak — Lk+1 and so on. 

The representation of feedback cycles with acyclic graphs also requires that time be considered as a discrete variable. 
That is, we say that treatment and covariates can change during each interval |k, k + 1) for k =0,1,...A, but we do 
not specify when exactly during the interval the change takes place. This discretization of time is not a limitation in 
practice: the length of the intervals can be chosen to be as short as the granularity of the data requires. For example, 
in a study where individuals see their doctors once per month or less frequently (as in our HIV example), time may be 
safely discretized into month intervals. In other cases, year intervals or day intervals may be more appropriate. Also, as 
we said in Chapter 17, time is typically measured in discrete intervals (years, months, days) any way, so the discretization 
of time is often not even a choice. 


did not bother to include a node Lo for baseline CD4 cell count. Just 
suppose that treatment Ao is marginally randomized and treatment Aj 
is conditionally randomized given Lı. 


e The unmeasured variable Up is not included. 


e There is no arrow from Ag to A1, which implies that treatment is assigned 
using information on Lı only. 


e There are no arrows from Ag, Lı and A, to Y, which would be the case 
if treatment has no causal effect on the outcome Y of any individual, i.e., 
the sharp null hypothesis holds. 


None of these simplifications affect the arguments below. A more compli- 
cated causal diagram would not add any conceptual insights to the discussion 


Aj ——L,—— A, Y in this chapter; it would just be harder to read. 
Now suppose that treatment has no effect on any individual’s Y, which im- 
f plies the causal diagram in Figure 20.3 is the correct one, but the investigators 
U do not know it. Also suppose that we have data on treatment Ag in month 0 
1 and A, in month 1, on the confounder CD4 cell count L at the start of month 1, 
Figure 20.3 and on the outcome Y at the end of follow-up. We wish to use these data to esti- 


mate the average causal effect of the static treatment strategy “always treat”, 
(ao = 1,a,; = 1), compared with the static treatment strategy “never treat”, 
(ao = 0,a1 = 0) on the outcome Y, that is, E [Y='%1="] — p [yao-0a1=0] 
According to Figure 20.3, the true, but unknown to the investigator, average 
causal effect is 0 because there are no forward-directed paths from either treat- 
ment variable to the outcome. That is, one cannot start at either Ag or A; 
and, following the direction of the arrows, arrive at Y. 

Figure 20.3 can depict a sequentially randomized trial because there are no 
direct arrows from the unmeasured U into the treatment variables. Therefore, 


Ao L,—— A, Y as we discussed in the previous chapter, we should be able to use the observed 
| Pa data on Ag, Li, Ai, and Y to conclude that E [Y%=}"=1] — E [yao=0.a1=0) 
is equal to 0. However, as we explain in the next section, we will not generally 

W, U, be able to correctly estimate the causal effect when we adjust for Lı using tra- 
ditional methods, like stratification, outcome regression, and matching. That 

Figure 20.4 is, in this example, an attempt to adjust for the confounder Lı using these 


methods will generally result in an effect estimate that is different from 0, and 
thus invalid. 
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Figure 20.3 represents either a se- 
quentially randomized trial or an 
observational study with no unmea- 
sured confounding; Figure 20.4 rep- 
resents an observational study. 


In other words, when there are time-varying confounders and treatment- 
confounder feedback, traditional methods cannot be used to correctly adjust 
for those confounders. Even if we had sufficient longitudinal data to ensure 
sequential exchangeability, traditional methods would not generally provide a 
valid estimate of the causal effect of any treatment strategies. In contrast, 
g-methods appropriately adjust for the time-varying confounders even in the 
presence of treatment-confounder feedback. 

This limitation of traditional methods applies to settings in which the time- 
varying confounders are affected by prior treatment as in Figure 20.3, but also 
to settings in which the time-varying confounders share causes W with prior 
treatment as in Figure 20.4, which is a subset of Figure 19.4. We refer to both 
Figures 20.3 and 20.4 (and Figures 19.2 and 19.4) as examples of treatment- 
confounder feedback. The next section explains why traditional methods can- 
not adequately handle treatment-confounder feedback. 


20.2 The bias of traditional methods 


This is an ideal trial with full adher- 
ence to the assigned treatment and 
no losses to follow-up. 





Table 20.1 
N Ao Li Ay Mean Y 
2400 0 0 0 84 
1600 0 0 1 84 
2400 0 1 0 52 
9600 0 1 1 52 
4800 1 0 0 76 
3200 1 0 1 76 
1600 1 1 0 44 
6400 1 1 1 44 


If there were additional times k at 
which treatment A; were affected 
by Lk, then Lx would be a time- 
varying confounder 


Figure 20.3 represents the null be- 
cause there is no arrow from L; to 
Y. Otherwise, Ag would have an 
effect on Y through Ly 


To illustrate the bias of traditional methods, let us consider a (hypothetical) 
sequentially randomized trial with 32,000 HIV-positive individuals and two 
time points k = 0 and k = 1. Treatment Ap = 1 is randomly assigned at 
baseline with probability 0.5. Treatment A; is randomly assigned in month 
1 with a probability that depends only on the value of CD4 cell count Ly at 
the start of month 1—0.4 if Lı = 0 (high), 0.8 if Lı = 1 (low). The outcome 
Y, which is measured at the end of follow-up, is a function of CD4 cell count, 
concentration of virus in the serum, and other clinical measures, with higher 
values of Y signifying better health. 

Table 20.1 shows the data from this trial. To save space, the table displays 
one row per combination of values of Ao, Lı, and Aj, rather than one row per 
individual. For each of the eight combinations, the table provides the number 
of subjects N and the mean value of the outcome E [Y |Ao, £1, A1]. Thus, row 1 
shows that the mean of the 2400 individuals with (Ap = 0, Lı = 0, Ar = 0) was 
E[Y|Aop = 0, Lı = 0, A; = 0] = 84. In this sequentially randomized trial, the 
identifiability conditions—sequential exchangeability, positivity, consistency— 
hold. By design, there are no confounders for the effect of Ap on Y, and Ly is 
the only confounder for the effect of A, on Y so (conditional on L4) sequential 
exchangeability holds. By inspection of Table 20.1, we can conclude that the 
positivity condition is satisfied, because otherwise one or more of the eight 
rows would have zero individuals. 

The causal diagram in Figure 20.3 depicts this sequentially randomized ex- 
periment when the sharp null hypothesis holds. To check whether the data in 
Table 20.1 are consistent with the causal diagram in Figure 20.3, we can sepa- 
rately estimate the average causal effects of each of the time-fixed treatments 
Ag and A, within levels of past covariates and treatment, which should all be 
null. In the calculations below, we will ignore random variability. 

A quick inspection of the table shows that the average causal effect of 
treatment A, is indeed zero in all four strata defined by Ap and Lı. Consider 
the effect of A; in the 4000 individuals with Ag = 0 and Lı = 0, whose data 
are shown in rows 1 and 2 of Table 20.1. The mean outcome among those 
who did not receive treatment at time 1, E[Y|Ao = 0, Lı = 0, A; = OJ, is 84, 
and the mean outcome among those who did receive treatment at time 1, 
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Technical Point 20.1 


G-null test. Suppose the sharp null hypothesis is true. Then any counterfactual outcome Y9 is the observed outcome 
Y. In this setting, sequential exchangeability for Y9 can be written as Y1LAo|Lo and Y1LAi|Ap = g(Lo), Lo, Lı in a 
study with two time points. The first independence implies no causal effect of Ag in any strata defined by Lo, and the 
second independence implies no causal effect of A; in any strata defined by Lı and Ap. Therefore, under sequential 
exchangeability, a test of these conditional independences is a test of the sharp null. This is the g-null test. 

Conversely, the g-null theorem (Robins 1986) says that, if these conditional independences hold, then the distribution 
of Y9 and therefore the mean E [Y9] is the same for all g, and also equal to the distribution and mean of the observed 
Y. Note, however, that equality of distributions under all g only implies the sharp null hypothesis under a strong 
form of faithfulness that forbids perfect cancellations of effects. As discussed in Fine Point 6.2, we assume faithfulness 
throughout this book unless we say otherwise. 


E[Y|Ao = 0, Lı = 0, Ai = 1], is also 84. Therefore the difference 




















E[Y|Ao = 0, Li = 0, Ay = 1] — E [Y |40 = 0, Li = 0, Ay = 0] 


is zero. Because the identifiability conditions hold, this associational difference 
validly estimates the average causal effect 


E [Y™=1|Ao — 0, Li = 0] _E [yrs = 0, Li = 0] 


in the stratum (Ap = 0, Lı = 0). Similarly, it is easy to check that the aver- 
age causal effect of treatment A; on Y is zero in the remaining three strata 
(Ap = 0, Lı = 1), (Ao = 1, Lı = 0), (Ao = 1, Lı = 1), by comparing the mean 
outcome between rows 3 and 4, rows 5 and 6, and rows 7 and 8, respectively. 

We can now show that the average causal effect of Ag is also zero. To do so, 
we need to compute the associational difference E[Y|Ap = 1] — E[Y|Ao = 0] 
which, because of randomization, is a valid estimator of the causal contrast 
E [Y%=1] — E[Y®=°]. The mean outcome E[Y|Ao = 0] among the 16,000 














The weighted average is individuals treated at time 0 is the weighted average of the mean outcomes in 
THD x 84 + igo X 84 + rows 1, 2, 3 and 4, which is 60. And E[Y|Ao = 1], computed analogously, is 
ang x 52+ Oe x 52 = 60 also 60. Therefore, the average causal effect of Ag is zero. 


We have confirmed that the causal effects of Ag and A; (conditional on 
the past) are zero when we treat Ap and A; separately as time-fixed treat- 
ments. What if we now treat the joint treatment (Ao, Ai) as a time-varying 
treatment and compare two treatment strategies? For example, let us say that 
we want to compare the strategies “always treat” versus “never treat”, that is 
(ao = 1,aı = 1) versus (ap = 0, a1 = 0). Because the identifiability conditions 
hold, the data in Table 20.1 should suffice to validly estimate this effect. 

Because the effect for each of the individuals components of the strategy, ao 
and a1, is zero, it follows from the g-null theorem that the average causal effect 
E [Yerba=l)] — E [y=] is zero. But is this what we conclude from 
the data if we use conventional analytic methods? To answer this question, 
let us conduct two data analyses. In the first one, we do not adjust for the 
confounder Lı, which should give us an incorrect effect estimate. In the second 
one, we do adjust for the confounder Lı via stratification. 


1. We compare the mean outcome in the 9600 individuals who were treated 
at both times (rows 6 and 8 of Table 20.1) with that in the 4800 individ- 
uals who were untreated at both times (rows 1 and 3). The respective 
averages are E[Y|Ap = 1, Ay = 1] = 54.7, and E[Y|Ap = 0, Ai = 0] = 
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B[V |Ay = 1,41 = 1] 








3200 6400 _ 

3600 X 76+ z500 X 44 = 54.7 
E[Y|Ao = 0, Ai = 0} 

2400 2400 _ 

800 X 84+ iog X 52 = 68.0 
Note that, because the effect is 


—8 in both strata of Ly, it is not 
possible that a weighted average 
of the stratum-specific effects will 
yield the correct value 0. 
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68. The associational difference is 54.7 — 68 = —13.3 which, if interpreted 
causally, would mean that not being treated at either time is better than 
being treated at both times. This analysis gives the wrong answer—a 
non-null difference—because E[Y|Ap = ao, Ai = ay] is not a valid esti- 
mator of E[Y%1]. Adjustment for the confounder Ly is needed. 


2. We adjust for Lı via stratification. That is, we compare the mean 
outcome in individuals who were treated with that in individuals who 
were untreated at both times, within levels of Lı. For example, take 
the stratum Lı = 0. The mean outcome in the treated at both times, 














E[Y|Ap = 1, Lı = 0, A; = 1], is 76 (row 6). The mean outcome in the un- 
treated at both times, E[Y|Ao = 0, Lı = 0, A; = 0], is 84 (row 1). The 
associational difference is 76 — 84 = —8 which, if interpreted causally, 


would mean that, in the stratum Lı = 0, not being treated at either 
time is better than being treated at both times. Similarly, the differ- 
ence E [Y|Ao 1, Lı 1, Aj 1] — E [Y|Ao 0, Li 1, A, 0] in the 
stratum Lı = 1 is also —8. 














What? We said that the effect estimate should be 0, not —8. How is 
it possible that the analysis adjusted for the confounder also gives a wrong 
answer? This estimate reflects the bias of traditional methods to adjust for 
confounding when there is treatment-confounder feedback. The next section 
explains why the bias arises. 


20.3 Why traditional methods fail 


Ao —> —> Ai 


Í 


U; 


Y 


Figure 20.5 


Table 20.1 shows data from a sequentially randomized trial with treatment- 
confounder feedback, as represented by the causal diagram in Figure 20.3. Even 
though no data on the unmeasured variable U, (immunosuppression level) is 
available, all three identifiability conditions hold: U; is not needed if we have 
data on the confounder Lı. Therefore, as discussed in Chapter 19, we should 
be able to correctly estimate causal effects involving any static or dynamic 
treatment strategies. And yet our analyses in the previous section did not 
yield the correct answer, whether or not we adjusted for L4. 

The problem was that we did not use the correct method to adjust for con- 
founding. Stratification is a commonly used method to adjust for confounding, 
but it cannot handle treatment-confounder feedback. Stratification means esti- 
mating the association between treatment and outcome in subsets—strata—of 
the study population defined by the confounders—ŻL; in our example. Because 
the variable Lı can take only two values—1 if the CD4 cell count is low, and 
0 otherwise—there are two such strata in our example. To estimate the causal 
effect in those with Lı = l, we selected (i.e., conditioned or stratified on) the 
subset of the population with value L, = L. 

But stratification can have unintended effects when the association measure 
is computed within levels of a variable Lı that is caused by prior treatment Ao. 
Indeed Figure 20.5 shows that conditioning on Lı—a collider—opens the path 
Ao > Lı < Uı > Y. That is, stratification induces a noncausal associa- 
tion between the treatment Ap at time 0 and the unmeasured variable U1, and 
therefore between Ag and the outcome Y, within levels of L4. Among those 
with low CD4 count (Lı = 1), being on treatment (Ap = 1) becomes a marker 
for severe immunosuppression (high value of U1); among those with a high level 
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Fine Point 20.2 


Confounders on the causal pathway. Conditioning on confounders Lı which are affected by previous treatment can 
create selection bias even if the confounder is not on a causal pathway between treatment and outcome. In fact, no 
such causal pathway exists in Figures 20.5 and 20.6. 

On the other hand, in Figure 20.7 the confounder L; for subsequent treatment A, lies on a causal pathway from 
earlier treatment Ag to outcome Y, i.e., the path Ag —> Lı —> Y. Were the potential for selection bias not present 
in Figure 20.7 (e.g., were Uy not a common cause of Lı and Y), the associational differences within strata of Lı could 
be an unbiased estimate of the direct effect of Ag on Y not through Ly, but still would not be an unbiased estimate of 
the overall effect of A on Y, because the effect of Ag mediated through Ly is not included. 

It is sometimes said that variables on a causal pathway between treatment and outcome cannot be considered as 
confounders, because adjusting for those variables will result in a biased effect estimate. However, this characterization 
of confounders is inaccurate for time-varying treatments. Figure 20.7 shows that a confounder for subsequent treatment 
A, can be on a causal pathway between past treatment Ap and the outcome. As for whether adjustment for confounders 
on a causal pathway induces bias for the effect of a treatment strategy, that depends on the choice of adjustment method. 
Stratification will indeed induce bias; g-methods will not. 





of CD4 (Lı = 0), being off treatment (Ao = 0) becomes a marker for milder 
immunosuppression (low value of U1). Thus, the side effect of stratification is 
to induce an association between treatment Ap and outcome Y. 

In other words, stratification eliminates confounding for A; at the cost of 
introducing selection bias for Ap. The associational differences 














Ay hg | Y 
24 E[Y|Ao = 1, £1 = l, Aı = 1] — E [Y |Ap = 0, Lı = 1, Ar = 0] 


wW U may be different from 0 even if, as in our example, treatment has no effect on 
0 } the outcome of any individuals at any time. This bias arises from choosing 
Figure 20.6 a subset of the study population by selecting on a variable Lı affected by (a 


component Ap of) the time-varying treatment. The net bias depends on the 
relative magnitude of the confounding that is eliminated and the selection bias 
that is created. 

ae Technically speaking, the bias of traditional methods will occur not only 
Ao Ly | ey Y when the confounders are affected by prior treatment (in randomized experi- 
ments or observational studies), but also when the confounders share an un- 
Pa measured cause W with prior treatment (in observational studies). In the 
U observational study depicted in Figure 20.6, conditioning on the collider Lı 
opens the path Aj -— Wo —> Lı <— U; —> Y. For this reason, we referred 
Figure 20.7 to both settings in Figures 20.3 and 20.4—which cannot be distinguished using 

the observed data—as examples of treatment-confounder feedback. 

The causal diagrams that we have considered to describe the bias of tra- 
ditional methods are all very simple. They only represent settings in which 
treatment does not have a causal effect on the outcome. However, conditioning 
on a confounder in the presence of treatment-confounder feedback also induces 
bias when treatment has a non-null effect, as in Figure 20.7. The presence of 
arrows from Ap, Ai, or Lı to Y does not change the fact that conditioning 
on Lı creates an association between Ap and Y that does not have a causal 
interpretation (see also Fine Point 20.2). Also, our causal diagrams had only 
two time points and a limited number of nodes, but the bias of traditional 
methods will also arise from high-dimensional data with multiple time points 
and variables. In fact, the presence of time-varying confounders affected by 
previous treatment at multiple times increases the possibility of a large bias. 
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In general, valid estimation of the effect of treatment strategies is only 
possible when the joint effect of the treatment components A; can be estimated 
simultaneously and without bias. As we have just seen, this may be impossible 
to achieve using stratification, even when data on all time-varying confounders 
are available. 


20.4 Why traditional methods cannot be fixed 


The number of data combinations 
is even greater because there are 
multiple confounders LD; measured 
at each time point k. 


We showed that stratification cannot be used as a confounding adjustment 
method when there is treatment-confounder feedback. But what about other 
traditional methods? For example, we could have used parametric outcome 
regression, rather than nonparametric stratification, to adjust for confounding. 
Would outcome regression succeed where plain stratification failed? 

This question is particularly important for settings with high-dimensional 
data, because in high-dimensional settings we will be unable to conduct a 
simple stratified analysis like we did in the previous section. In Table 20.1, 
treatment A; occurs at two months k = 0,1, which means that there are 
only 2? static treatment strategies @. But when the treatment A, occurs at 
multiple points k = 0,1...K, we will not be able to present a table with all the 
combinations of treatment values. If, as is not infrequent in practice, K is of 
the order of 100, then there are 2100 static treatment strategies ā, a staggering 
number that far exceeds the sample size of any study. The total number of 
treatment strategies is much greater when we consider dynamic strategies as 
well. 

As we have been arguing since Chapter 11, we will need modeling to es- 
timate average causal effects involving E [Y7] when there are many possible 
treatment strategies a. To do so, we will need to hypothesize a dose-response 
function for the effect of treatment history @ on the mean outcome Y. One 
possibility would be to assume that the effect of treatment strategies @ in- 
creases linearly as a function of the cumulative treatment under each strategy. 
Under this assumption, all strategies that assign treatment for exactly three 
months have the same effect, regardless of the period when those three months 
of treatment occur, e.g., the first 3 months of follow-up, the last 3 months of 
follow-up, etc. The price paid for modelling is yet another threat to the valid- 
ity of our estimates due to possible model misspecification of the dose-response 
function. 

Unfortunately, regression modeling does not remove the bias of traditional 
methods in the presence of treatment-confounder feedback, as we now show. 
For the data in Table 20.1, let us define cumulative treatment cum (A) = 
Ao + Ai, which can take 3 values: 0 (if the individuals remains untreated at 
both times), 1 (if the subject is treated at time 1 only or at time 2 only), 
and 2 (if the subject is treated at both times). The treatment strategies of 
interest can then be expressed as “always treat” cum (a) = 2, and “never treat” 
cum (@) = 0, and the average causal effect as E [You =?] — E [youn =9) 
Again, any valid method should estimate that the value of this difference is 0. 

Under the assumption that the mean outcome E [Y |A, L1] depends linearly 
on the covariate cum (A), we could fit the outcome regression model 


E[Y|A, Lı] = 4 + Acum (A) + 02L1 


The associational difference E [Y|cum (A) = 2, Li] —E [Y|cum (A) = 0, Li] is 
equal to 0; x 2. (The model correctly assumes that the difference is the same in 


254 


We invite readers to check for 
themselves that 0; is not zero by fit- 
ting this outcome regression model 
to the data in Table 20.1. 


Treatment-confounder feedback 


the strata Lı = 1 and Lı = 0.) Therefore some might want to interpret 6; x 2 
as the average causal effect of “always treat” versus “never treat” within levels 
of the covariate Lı. But such causal interpretation is unwarranted because, 
as Figure 20.5 shows, conditioning on Lı induces an association between Ag, 
a component of treatment cum (A), and the outcome Y. This implies that 
0ı—and therefore the associational difference of means—is non-zero even if 
the true causal effect is zero. A similar argument can be applied to matching. 
G-methods are needed to appropriately adjust for time-varying confounders in 
the presence of treatment-confounder feedback. 


20.5 Adjusting for past treatment 


Ao L, —— A, Y 
U; 
Figure 20.8 
S 
Ao L,—— Aj Y 
Wo U; 

Figure 20.9 
Aa 
Aj —— L— A, Y 

U, 
Figure 20.10 


One more thing before we discuss g-methods. For simplicity, we have so far 
described treatment-confounder feedback under simplified causal diagrams in 
which past treatment does not directly affect subsequent treatment. That is, 
the causal diagrams in Figures 20.3 and 20.4 did not include an arrow from Ap 
to A,. We now consider the more general case in which past treatment may 
directly affect subsequent treatment. 

As an example, suppose doctors in our HIV study use information on past 
treatment history A,_; when making a decision about whether to prescribe 
treatment A, at time k. To represent this situation, we add an arrow from Ap 
to A; to the causal diagrams in Figures 20.3 and 20.4, as depicted in Figures 
20.8 and 20.9. 

The causal diagrams in Figures 20.8 and 20.9 show that, in the presence of 
treatment-confounder feedback, conditioning on Lı is insufficient to block all 
backdoor paths between treatment A, and outcome Y. Indeed conditioning 
on Lı opens the path Ay — Ag —> Ly — Uı — Y in Figure 20.8, and the 
path Ay — Ap — Wo —> Lı — Uy — Y in Figure 20.9. Of course, regardless 
of whether treatment-confounder feedback exists, conditioning on past treat- 
ment history is always required when past treatment has a non-null effect on 
the outcome, as in the causal diagram of Figure 20.10. Under this diagram, 
treatment Ag is a confounder of the effect of treatment A,. 

Therefore, sequential exchangeability at time k generally requires condition- 
ing on treatment history A,_; before k; conditioning only on the covariates L 
is not enough. That is why, in this and in the previous chapter, all the con- 
ditional independence statements representing sequential exchangeability were 
conditional on treatment history. 

Past treatment plays an important role in the estimation of effects of time- 
fixed treatments too. Suppose we are interested in estimating the effect of 
the time-fixed treatment A,;—as opposed to the effect of a treatment strat- 
egy involving both Ag and Aı—on Y. (Sometimes the effect of Aj is re- 
ferred to as the short-term effect of the time-varying treatment A.) Then 
lack of adjustment for past treatment Ag will generally result in selection 
bias if there is treatment-confounder feedback, and in confounding if past 
treatment Ao directly affects the outcome Y. In other words, the difference 
E[Y|A; = 1, Lı]— E [Y|Aı = 0, Lı] would not be zero even if treatment A; had 
no effect on any individual’s outcome Y, as in Figures 20.8-20.10. In practice, 
when making causal inferences about time-fixed treatments, bias may arise in 
analyses that compare current users (A; = 1) versus nonusers (A; = 0) of 
treatment. To avoid the bias, one can adjust for prior treatment history or 
restrict the analysis to individuals with a particular treatment history. This 
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is the idea behind “new-user designs” for time-fixed treatments: restrict the 
analysis to individuals who had not used treatment in the past. 

The requirement to adjust for past treatment has additional bias impli- 
cations when past treatment is mismeasured. As discussed in Section 9.3, a 
mismeasured confounder may result in effect estimates that are biased, either 
upwards or downwards. In our HIV example, suppose investigators did not 
have access to the study participants’ medical records. Rather, to ascertain 
prior treatment, investigators had to ask participants via a questionnaire. Since 
not all participants provided an accurate recollection of their treatment his- 
tory, treatment Ag was measured with error. Investigators had data on the 
mismeasured variable Aj rather than on the variable Ap. To depict this set- 
ting in Figures 20.8-20.10, we add an arrow from the true treatment Ao to 
the mismeasured treatment Aj, which shows that conditioning on Aj cannot 
block the biasing paths between A; and Y that go through Ag. Investigators 
will then conclude that there is an association between A, to Y, even after 
adjusting for Aj and L4, despite the lack of an effect on A, on Y. Therefore, 
when treatment is time-varying, we find that, contrary to a widespread belief, 
mismeasurement of treatment—even if the measurement error is independent 
and non-differential—may cause bias under the null. This bias arises because 
past treatment is a confounder for the effect of subsequent treatment, even 
if past treatment has no causal effect on the outcome. Furthermore, under 
the alternative, this imperfect bias adjustment may result in an exaggerated 
estimate of the effect. 
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Chapter 21 
G-METHODS FOR TIME-VARYING TREATMENTS 


In the previous chapter we described a dataset with a time-varying treatment and treatment-confounder feedback. 
We showed that, when applied to this dataset, traditional methods for confounding adjustment could not correctly 
adjust for confounding. Even though the time-varying treatment had a zero causal effect on the outcome, traditional 
adjustment methods yielded effect estimates that were different from the null. 

This chapter describes the solution to the bias of traditional methods in the presence of treatment-confounder 
feedback: the use of g-methods—the g-formula, IP weighting, g-estimation, and their doubly-robust generalizations. 
Using the same dataset as in the previous chapter, here we show that the three g-methods yield the correct (null) 
effect estimate. For time-fixed treatments, we described the g-formula in Chapter 13, IP weighting of marginal 
structural models in Chapter 12, and g-estimation of structural nested models in Chapter 15. Here we introduce 
each of the three g-methods for the comparison of static treatment strategies under the identifiability conditions 
described in Chapter 19: sequential exchangeability, positivity, and consistency. 


21.1 The g-formula for time-varying treatments 


Consider again the data from the sequentially randomized experiment in Table 
20.1 which, for convenience, we reproduce again here as Table 21.1. Suppose 
we are only interested in the effect of the time-fixed treatment A;. That is, 





Table 21.1 suppose we want to contrast the mean counterfactual outcomes E [Y%51] and 
N Ao Lı A MeanY p (yee, In Parts I and II we have showed that, under the identifiabil- 
2400 0 0 0 84 ity conditions, each of the means E[Y“'] is a weighted average of the mean 
1600 0 0 1 84 outcome E[Y|A; = a1, Lı = l4] conditional on the (time-fixed) treatment and 
2400 0 1 0 52 confounders. Specifically, E[Y“'] equals the weighted average 
9600 0 1 1 52 
4800 1 0 0 76 2 = = = 
AR E > 2 a, Lı =|] f (l1), where f (l1) = Pr [L1 = h]. 

1600 1 1 0 44 
6400 1 1 1 44 This weighted average is the g-formula. Under conditional exchangeability 


given the time-fixed confounders L4, the g-formula is the mean outcome stan- 
dardized to the distribution of the confounders in the study population. 

But, in the sequentially randomized experiment of Table 21.1, the treat- 
ment A = (Ao, Aı) is time-varying and, as we saw in the previous chapter, 
there is treatment-confounder feedback. That means that traditional adjust- 
ment methods cannot be relied on to unbiasedly estimate the causal effect of 
time-varying treatment A. For example, traditional methods may not provide 
valid estimates of the mean outcome under “always treat” E yee] and 
the mean outcome under “never treat” E [Y7-°%=°] even in a sequentially 
randomized experiment in which sequential exchangeability holds. In contrast, 
the g-formula can be used to calculate the counterfactual means E[Y¢*] in 
a sequentially randomized experiment. To do so, the above expression of the 
g-formula for time-fixed treatments needs to be generalized. 

The g-formula for E[Y °°] under the identifiability conditions (described 
in Chapter 19) will still be a weighted average, but now it will be a weighted 
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In a study with 2 time points, the 
g-formula for “never treat” is 














E[Y|Ao 0, Ay 0, Li 0} x 
Pr [Z = 0|Ao = 0] + 
E[Y|Ao 0, Ay 0, Li 1] x 


Pr [Ly = 1|Ao = 0] 


Figure 21.1 


G-methods for time-varying treatments 








average of the mean outcome E [Y |Ao = ao, Ay = a1, Lı = lı] conditional on 
the time-varying treatment and confounders required to achieve sequential ex- 
changeability. Specifically, the g-formula 








XC E[Y|Ao ao, A1 = a1, Li = l1] f (llao) 
lı 


equals E[Y%:*] under (static) sequential exchangeability for Y°*!. That is, 
for a time-varying treatment, the g-formula estimator of the counterfactual 
mean outcome under the identifiability conditions is the mean outcome stan- 
dardized to the distribution of the confounders in the study population, with 
every factor in the expression conditional on past treatment and covariate his- 
tory. This conditioning on prior history is not necessary in the time-fixed case 
in which both treatment and confounders are measured at a single time point. 

Note that the g-formula is only computable (i.e., well-defined) if, for any 
value lı such that f (l1|ao) Æ 0, there are individuals in the population with 
(Ap = ao, Air = a1, Lı = l1). This is equivalent to the definition of positivity 
given in Technical Point 19.2 and a generalization for time-varying treatments 
of the discussion of positivity in Technical Point 3.1. 

Let us apply the g-formula to estimate the causal effect E Psi aioe — 
E |y 7o=0a1=0] from the sequentially randomized experiment of Table 21.1. 
The g-formula estimate for the mean E [eet is 84x0.25+52x0.75 = 60. 
The g-formula estimate for the mean E [Y2=)“=1] is 76 x0.50+44 0.50 = 60. 
Therefore the estimate of the causal effect E [Y 1051151] — E [Y00=021=9] is 
0, as expected. The g-formula succeeds where traditional methods failed. 


N Edy] 
2400 84 

) 1600 84 
2400 52 
9600 52 
4800 76 
3200 76 
1600 44 





Another way to think of the g-formula is as a simulation. Under sequential 
exchangeability for Y and L jointly, the g-formula simulates the counterfac- 
tual outcome Y% and covariate history L* that would have been observed if 
everybody in the study population had followed treatment strategy a. In other 
words, the g-formula simulates (identifies) the joint distribution of the coun- 
terfactuals (es L?) under strategy @. To see this, first consider the causally- 
structured tree graph in Figure 21.1, which is an alternative representation of 
the data in Table 21.1. Under the aforementioned identifiability condition, the 
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Figure 21.2 


Under sequential exchangeabil- 
ity, Pr [Ly = lı|Ao = ao] = 
Pr [L950 = l] 








and 

E [Y| Ao ao, Ay ay, Ly l] 

E [Yn] LE = l]. 

Thus the g-formula is 
a E [y 400% LP = l] 

Pr[LP = l], which equals 


E [Y %02] as required. 


Under any of the causal diagrams 
shown in this book, the g-formula 
that includes all the unmeasured 
variables—such as U and W—is al- 
ways correct. Unfortunately, the 
unmeasured variables are by defin- 
ition unavailable to the investiga- 
tors. 


The g-formula for time-varying 
treatments was first described by 
Robins (1986, 1987). 
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g-formula can be viewed as a procedure to build a new tree in which all indi- 
viduals follow strategy a. For example, the causally-structured tree graph in 
Figure 21.2 shows the counterfactual population that would have been observed 
if all individuals have followed the strategy “always treat” (aọ = 1,a1 = 1). 


N E 


16,000 76 





16,000 44 


To simulate this counterfactual population we (i) assign probability 1 to 
receiving treatment ag = 1 and a, = 1 at times k = 0 and k = 1, respectively, 
and (ii) assign the same probability Pr |L; = l1|Ao = ao] and the same mean 
E [Y |Ao = ao, Ay = a1, Ly = lı] as in the original study population. 

Two important points. First, the value of the g-formula depends on what, 
if anything, has been included in L. As an example, suppose we do not collect 
data on Lı because we believe, incorrectly, that our study is represented by 
a causal diagram like the one in Figure 20.8 after removing the arrow from 
Lı to A;. Thus we believe Lı is not a confounder and hence not necessary 
for identification. Then the g-formula in the absence of data on Lı becomes 
E [Y |Ao = ao, Ay = a4] because there is no covariate history to adjust for. How- 
ever, because our study is actually represented by the causal graph in Figure 
20.8. (under which treatment assignment Aj is affected by Lı), the g-formula 
that fails to include L; no longer has a causal interpretation. 

Second, even when the g-formula has a causal interpretation, each of its 
components may lack a causal interpretation. As an example, consider the 
causal diagram in Figure 20.9 under which only static sequential exchangeabil- 
ity holds. The g-formula that includes Lı correctly identifies the mean of Y°. 
Remarkably, regardless of whether we add arrows from Ag and A; to Y, the g- 
formula continues to have a causal interpretation as E [Y%], even though neither 
of its components—E [Y | Ao = ao, Ay = a1, Ly = l4] and Pr [Li = l| Ao = ao|— 








has any causal interpretation at all. That is, Pr[Z1 = 11|Ao = ao] # Pr [L495 = l1] 








and E[Y|Ao = ao, Ai = a1, Li = h] E [Yen] L =i]. The last two in- 
equalities will be equalities in a sequential randomized trial like the one repre- 
sented in Figures 20.1 and 20.2. 

Now let us generalize the g-formula to high-dimensional settings with mul- 
tiples times k. The g-formula is 


K 
STE [y|A=a,L=]] Il f (Uel@x—a, 0x1) , 
l k=0 
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Fine Point 21.1 


Treatment and covariate history. When describing g-methods, we often refer to the treatment and covariate history 
that is required to achieve sequential exchangeability. For the g-formula, we say that its components are conditional on 
prior treatment and covariate history. For example, the factor corresponding to the probability of a discrete confounder 
Lo at time k = 2 














f (|A = G1, Ly = l1) = Pr [L2 = l2|Ao = ao, A1 = a1, Lo = lo, La = hh] 


is conditional on treatment and confounders at prior times 0 and 1; the factor at time k = 3 is conditional on treatment 
and confounders at times 0, 1, and 2, and so on. 

However, the term “history” is not totally accurate because, as explained in Fine Point 7.2, confounders can 
theoretically be in the future of treatment. Conversely, as explained along with Figure 7.4, adjusting for some variables 
in the past of treatment may introduce selection bias (sometimes referred to as M-bias). Therefore, the causally 
relevant “history” at time k needs to be understood as the set of treatment and confounders that are needed to achieve 
conditional exchangeability for treatment Ap. Usually, those confounders Ly will be in the past of treatment A; so, for 
convenience, we will keep using the term “history” throughout the book. 





where the sum is over all possible /-histories (J, is the history through time 
k—1). Under sequential exchangeability for Y” given (Le, Ax) at each time k, 
this expression equals the counterfactual mean E[Y“] under treatment strategy 
a. Fine Point 21.1 presents a more nuanced definition of the term “history”. 
Technical Point 21.1 shows a more general expression for the g-formula, which 
can be used to compute densities, not just means. 

In practice, however, the components of the g-formula cannot be directly 
computed if the data are high-dimensional, as is expected in observational stud- 
ies with multiple confounders or time points. The quantities E [Y|A =ā, L= 1] 
and f (lx|āk-1, Lett) will need to be estimated. For example, we can fit a lin- 
ear regression model to estimate the conditional means E [Y|A =ā,L = l] of 
the outcome variable at the end of follow-up, and logistic regression mod- 
els to estimate the distribution of the discrete confounders Lẹ at each time 
k # 0 (the distribution of Lo can be estimated without models as described 
in Section 13.3). The estimates from these models, E [Y|A =a,L =1] and 


f (lslG@n—1,le-1), will then be plugged in into the g-formula. Since Chapter 
13, we have referred to this estimator as the plug-in g-formula and, when the 
estimates used in the plug-in g-formula are based on parametric models, we 
have referred to the plug-in g-formula as the parametric g-formula. 

For simplicity, this chapter presents a version of the g-formula under deter- 
ministic static strategies only. However, the g-formula can be used to compute 
the mean of the outcome under any treatment strategy: deterministic or ran- 
dom, static or dynamic. Let us define ft” (ar|āk-1, lx) as the conditional prob- 
ability of treatment a, at time k if the treatment strategy (or intervention) of 
interest had been implemented in the population. Then, the general expression 
of the g-formula is 


K 


K 
SOE [Y|A=4,L=]] Il f (Ucl@x—1, le—1) II f'"* (axldx—1, te) « 
k=0 


l k=0 


Under a deterministic treatment strategy, f'”* (a;,|Gi—1, Ui) is always 1 and 
therefore does not need to be specified. For example, under the strategy 
“always treat” or @ = (1,1,...1), the probability f’* (1|@,-1,l,) = 1 at all 
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Technical Point 21.1 


The g-formula density for static strategies. The g-formula density for (Y, L) evaluated at (y,/) for a strategy a is 


K 
f (yla,0) [] f Glan 1) 
k=0 


The g-formula density for Y is simply the marginal density of Y under the g-formula density for (Y, L): 


K 


fs (yla, 1) II dF (1x |Gr—15 U1) 5 


k=0 


where the integral notation f is used—instead of the sum notation $ —to accommodate settings in which Ly represents 
a vector of variables and some of the variables in the vector are continuous. 

Given observed data O = (A, V, Y) where V is the set of all measured variables other than treatment A and 
outcome Y, the inputs of the g-formula are (i) a treatment strategy G, (ii) a causal DAG representing the observed data 
(and their unmeasured common causes), (iii) a subset L of V for which we wish to adjust, and (iv) a choice of a total 
ordering of L, A, and Y consistent with the topology of the DAG, i.e., an ordering such that each variable comes after 
its ancestors. The vector Lx consists of all variables in L after Az_1 and before A, in the ordering. The chosen ordering 
will usually, but not always, be temporal. See Fine Point 21.1 and Pearl and Robins (1995) for additional subtleties 
that are beyond the scope of this book. When sequential exchangeability for Y” and positivity holds for the chosen 
ordering, the g-formula density for Y equals the density of Y that would have been observed in the study population if all 
individuals had followed strategy a. Otherwise, the g-formula can still be computed, but it lacks a causal interpretation. 

Note that the g-formula density for Y, L under treatment strategy a differs from the joint distribution 


K K 
f (y| Ak = Gx, Le = lp) II f (| Ag—1 = āk-1, Lk-1 = lk—1) Il f (ak|Ak-1 = Gp_1, Ly = lx) 


k=O k=0 





only in that each factor f (a;,[An—1 = G1, Lk = li) is eliminated. Note that each of the remaining factors are 
evaluated at A; = Gy consistent with strategy a. 


k. However, under other types of treatment strategies, f*”* (a;,|@n—1, lx) is 
CODE: The gfoRmula R pack- not 1 at all k and therefore needs to be included in the g-formula. For 
age (Lin et al. 2019) is available example, under the random static strategy “independently at each time k, 
through CRAN. The GFORMULA treat individuals with probability 0.3 and do not treat with probability 0.7”, 
SAS macro is available through f”* (1|@x—1, le) = 0.3. Our publicly available software implements this general 
GitHub. See the book’s web site. form of the g-formula and therefore can accommodate any treatment strategy. 


21.2 IP weighting for time-varying treatments 


Suppose we are only interested in the effect of the time-fixed treatment A, 
in Table 21.1. We then want to contrast the counterfactual mean outcomes 
E [Y %5] and E[Y“=°]. As we have seen in Chapter 12, under the iden- 
tifiability conditions, each of the counterfactual means E[Y“'] is the mean 
Eps [Y|A1 = a1] in the pseudo-population created by the subject-specific non- 
stabilized weights W4! = 1/f(A,|L1) or the stabilized weights SW41 = 
f (A1) /f (A1|£1). The denominator of the IP weights is, informally, an individ- 
ual’s probability of receiving the treatment value that he or she received, condi- 
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The same estimate of 0 is ob- 
tained when using stabilized IP 
weights SW4 in Figure 21.3 
(check for yourself). How- 
ever, Prys |Ak = 1|Ak-1, Lk] is 
1/2 in the nonstabilized pseudo- 
population and Pr |Ak = 1|A;_1] 
in the stabilized pseudo-population. 
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tional on the individual’s confounder values. One can estimate Eps [Y |A; = a1] 
from the observed study data by the average of Y among subjects with Ai = a1 
in the pseudo-population. 


When treatment and confounders are time-varying, these IP weights for 
time-fixed treatments need to be generalized. For a time-varying treatment 
A = (Ao, A1) and time-varying covariates L = (Lo, L1) at two time points, the 
nonstabilized IP weights are 


r 1 1 z 


1 
A _ = —— 
os f (Ao|Lo) f (A1|Ao, Lo, Li) Usa) 





and the stabilized IP weights are 


f(Ar|Ao) Il f (AxlAn-1) 
f (A1|Ao, Lo, £1) f (Ag|Ax—1, Lr) 


k=0 


A f (Ao) 


eu alts) 





where A_jis 0 by definition. The denominator of the IP weights for a time- 
varying treatment is, informally, an individual’s probability of receiving the 
treatment history that he or she received, conditional on the individual’s co- 
variate history. 


Suppose we want to contrast the counterfactual mean outcomes E [aot 
and E [Y%>°1=°]_ Under the identifiability assumptions for static strategies, 
each counterfactual mean E[Y%™] is the mean Eps [Y|Ao = ao, Aı = ai] in 
the pseudo-population created by the nonstabilized weights W4 or the stabi- 
lized weights SW4. The IP weighted estimator of each counterfactual mean 
is the average of Y among individuals with A = (Ag, A) in the pseudo- 
population. 


Let us apply IP weighting to the data from Table 21.1. The causally- 
structured tree in Figure 21.3 is the tree graph in Figure 21.1 with additional 
columns for the nonstabilized IP weights W4 and the number of individuals 
in the corresponding pseudo-population Nw for each treatment and covariate 
history. The pseudo-population has a size of 128,000, that is, the 32,000 
individuals in the original population multiplied by 4, the number of static 
strategies. Because there is no Lo in this study, the denominator of the IP 
weights simplifies to f (Ao) f (A1|Ao, L1). 


The IP weighted estimator for the counterfactual mean E [Y%>°%=°| is 
the mean Eps [Y|Ao = 0, A1 = 0] in the pseudo-population, which we estimate 
as the average outcome among the 32, 000 individuals with (Ag = 0, A; = 0) in 
the pseudo-population. From the tree in Figure 21.3, the estimate is 84x 300° + 


32000 


52 x 24000 = 60. Similarly, the IP weighted estimate of E [ysna] is also 


60. Therefore the estimate of the causal effect E [Y2=}@=1] — E [y20=901=0) 
is 0, as expected. IP weighting, like the g-formula, succeeds where traditional 
methods failed. 





Note that our nonparametric estimates of E [Y °%41] based on the g-formula 
are precisely equal to those based on IP weighting. This equality has nothing 
to do with causal inference. That is, even if the identifiability conditions did 
not hold—so neither the g-formula nor IP weighting estimates have a causal 
interpretation—both approaches would yield the same mean in the population. 
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Figure 21.3 


In practice, the most common ap- 
proach is to fit a single model for 
Pr [Ax = 1|Ak-1, Le] rather than 
a separate model at each time k. 
The model includes a function of 
time k—often referred to as a time- 
varying intercept—as a covariate. 
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Let us generalize IP weighting to high-dimensional settings with multiple 
times k = 0, 1...K. The general form of the nonstabilized IP weights is 


-ilz i ce 1, Lx) 
and the general form of the stabilized IP weights is 


-i7 f (Ar|Ar-1) 
ao (Ax|An—1, Le) 


When the identifiability conditions hold, these IP weights create a pseudo- 
population in which (i) the mean of Y” is identical to that in the actual popu- 
lation, but (ii) like on Figure 19.1, the randomization probabilities at each time 
k are constant (nonstabilized weights) or depend at most on past treatment 
history (stabilized weights). Hence the average causal effect E [Y7] — E [y=] 
is Eps [Y|A = a] — Eps [Y|A = @’] because unconditional sequential exchange- 
ability holds in both pseudo-populations. E 7 

In a true sequentially randomized trial, the quantities f (Ar| Ak-1, Ly) are 
known by design. Therefore we can use them to compute nonstabilized IP 
weights and the estimates of E [Y"] and E [Y7] — E [y] are guaranteed to be 
unbiased. In contrast, in observational studies, the quantities f (Ak|Ak-1, Lx) 
will need to be estimated from the data. When the data are high-dimensional, 
we can, for example, fit a logistic regression model to estimate the condi- 
tional probability of a dichotomous treatment Pr [Ak = 1Ap-1, Ly] at each 
time k. The estimates f (Ak|Ar-1, Le) from these models will then replace 
f (Ark|Ak-1, Lk) in W4. If the estimates f (A;|Ax—1, Lr) are based on a mis- 
specified logistic model for the Pr [Ax = 1|Ak-1, Lk], the resulting estimates 
of E[Y?] and E[Y?] — E bagi will be biased. For stabilized weights SW4 
we must also obtain an estimate of f (Ag|Ax_1) for the numerator. Even 
if this estimate is based on a misspecified model, the estimates of E[Y“] and 
E[Y*]-E ve | remain unbiased, although the distribution of treatment in the 
stabilized pseudo-population will differ from that in the observed population. 
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There is no logical guarantee of no 
model misspecification even when 
the estimates from both paramet- 
ric approaches are similar, as they 
may both be biased in the same di- 
rection. 


This marginal structural model is 
unsaturated. Remember, saturated 
models have an equal number of 
unknowns on both sides of the 
equation. 
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Suppose that we obtain two estimates of E [Y7], one using the parametric 
g-formula and another one using IP weights estimated via parametric models, 
and that the two estimates differ by more than can be reasonably explained by 
sampling variability (the sampling variability of the difference of the estimates 
can be quantified by bootstrapping). We can then conclude that the parametric 
models used for the g-formula or the parametric models used for IP weighting 
(or both) are misspecified. This conclusion is always true, regardless of whether 
the identifiability assumptions hold. An implication is that one should always 
estimate E [Y7] using both methods and, if the estimates differ substantially 
(according to some prespecified criterion), reexamine all the models and modify 
them where necessary. In the next section, we describe how doubly-robust 
estimators can help deal with model misspecification. 

Also, as we discussed in the previous section, it is not infrequent that the 
number of unknown quantities E [Y7] far exceeds the sample size. Thus we 
need to specify a model that combines information from many strategies to 
help estimate a given E[Y°%]. For example, we can hypothesize that the effect 
of treatment history @ on the mean outcome increases linearly as a function of 


K 
the cumulative treatment cum (a) = X` ap under strategy a. This hypothesis 
k=0 


is encoded in the marginal structural mean model 
E[Y*] = bo + Bicum (a) 


for all G, which is a more general version of the marginal structural mean model 
for time-fixed treatments discussed in Chapter 12. There are 2% different 
unknown quantities on the left hand side of model, one for each of the 2* 
different strategies a, but only 2 unknown parameters {9 and 81 on the right 
hand side. The parameter 8ı measures the average causal effect of the time- 
varying treatment A. The average causal effect E [Y7] — E es] is equal to 
Bı x cum (a). 

As discussed in Chapter 12, to estimate the parameters of the marginal 
structural model, we fit the ordinary linear regression model 


E [Y|A] = bo + 0:cum (A) 


in the pseudo-population, that is, we use weighted least squares with weights 
being estimates of either SW“ or W4. Under the identifiability conditions, the 
estimate of the associational parameter 0; is consistent for the causal parame- 
ter 61. As described in Chapter 12, the variance of $;—and thus of the contrast 


E[Y?] — E [en —can be estimated by the nonparametric bootstrap or by 


computing its analytic variance (which requires additional statistical analysis 
and programming). We can also construct a conservative 95% confidence in- 
terval by using the robust variance estimator of Bi, which is directly outputted 
by most statistical software packages. For a non-saturated marginal structural 
model the width of the intervals will typically be narrower when the model is 
fit with the weights SW4 than with the weights W4, so the SW4 weights are 
preferred. 

Of course, the estimates of E [Y7] will be incorrect if the marginal struc- 
tural mean model is misspecified, that is, if the mean counterfactual outcome 
depends on the treatment strategy through a function of the time-varying treat- 


ment other than cumulative treatment cum (@) (say, cumulative treatment only 
K 
in the final 5 months ` ap) or depends nonlinearly (say, quadratically) on 
k=K-5 
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The g-null paradox. When using the parametric g-formula, model misspecification will result in biased estimates of 
E[Y“], even if the identifiability conditions hold. Suppose there is treatment-confounder feedback and the sharp null 
hypothesis of no effect of treatment on Y is true, that is, 


Y7 — Y7” =0 with probability 1 for all a’ and a. 


Then the value of the g-formula for E[Y°] is the same for any strategy a, even though E[y|A=a4,b=]] 
and f (Uie|@e—1, Ue—-1) both depend on a. Now suppose we use standard non-saturated parametric models 
E [Y|A =a4,L=1; 6] and f (lel@x—1, le-13 9) based on distinct (i.e., variation-independent) parameters 0 and ọ to 
estimate the components of the g-formula. Then Robins and Wasserman (1997) showed that, when Lẹ has any discrete 
components, these models cannot all be correctly specified because the estimated value of the g-formula for E [Y°] 
will generally depend on a. As a consequence, inference based on the estimated g-formula might theoretically result 
in the sharp null hypothesis being falsely rejected, even in a sequentially randomized experiment. This phenomenon is 
referred to as the null paradox of the estimated g-formula for time-varying treatments. See Cox and Wermuth (1999) for 
additional discussion. Fortunately, the g-null paradox has not prevented the parametric g-formula to estimate null effects 
in practice, presumably because the bias induced by the paradox is small compared with typical random variability. 

In contrast, and as described in Chapters 12 and 14, neither IP weighting of marginal structural mean models nor 
g-estimation of structural nested mean models suffer from the null paradox in a sequentially randomized experiment 
where the treatment probabilities are known by design. These methods preserve the null because the models are correctly 
specified no matter what functional form we choose for treatment. For example, the marginal structural mean model 
E[Y*] = 80+1cum (4) is correctly specified under the null because, in that case, 3; = 0 and E[Y°] would not depend 
on the function of a. Also, any structural nested mean model +z, (G.-1, le, b) is also correctly specified under the null 
with 8 = 0 being the true parameter value and yk (G-1, lx, B) = 0, regardless of the function of past treatment and 
covariate history. 





cumulative treatment. However, if we fit the model 
E[Y|A] = 8o + bıcum (A) + O:cum_s (A) + O3cum (A)? 


with weights SW4 or W4, a Wald test on two degrees of freedom of the joint 
This test will generally have good hypothesis 62 = 03 = 0 is a test of the null hypothesis that our marginal struc- 
statistical power against the partic- tural model is correctly specified. That is, IP weighting of marginal structural 
ular directions of misspecification models is not subject to the g-null paradox described in Technical Point 21.2. 
mentioned above, especially if the In practice, one might choose to use a marginal structural model that includes 
weights SW“ are used. different summaries of treatment history A as covariates, and that uses flexible 
functions like, say, cubic splines. 
Finally, as we discussed in Section 12.5, we can use a marginal structural 
model to explore effect modification by a subset V of the covariates in Lo. 
For example, for a dichotomous baseline variable V, we would elaborate our 
marginal structural mean model as 


E[Y°|V] = bo + Bicum (a) + B2V + B3cum (a) V 


The parameters of this model can be estimated by fitting the ordinary linear 
regression model E [Y |A, V] = 00+0icum (A) +02V+03V cum (A) by weighted 
i i EF (Arkl|Ak-1, V) 
least squares with IP weights W4 or, better, SW4 (V) = II —________.. 
rar (Ak|Ak-1, Lg) 
In the presence of treatment-confounder feedback, V can only include baseline 
variables. If V had components of Ly for k > 0 then the parameters 6,and 63 
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could be different from 0 even if treatment had no effect on the mean outcome 
at any time. 

We now describe a doubly robust estimator of marginal structural mean 
models. 


21.3 A doubly robust estimator for time-varying treatments 


Doubly robust estimators give us 
two chances to get it right when, 
as in most observational studies, 
there are many confounders and 
non-saturated models are required. 


The use of the “clever covariate” 
D to achieve double robustness 
was first proposed by Bang and 
Robins (2005) for both time-fixed 
and time-varying treatments. 


Part II briefly mentioned doubly robust methods that combine IP weighting 
and the g-formula. As we know, IP weighting requires a correct model for treat- 
ment A conditional on the confounders L, and the g-formula requires a correct 
model for the outcome Y conditional on treatment A and the confounders L. 
Doubly robust methods require a correct model for either treatment A or out- 
come Y. If at least one of the two models is correct (and one need not know 
which of the two models is correct), a doubly robust estimator consistently 
estimates the causal effect. Technical Point 13.2 described a doubly robust es- 
timator for the average causal effect of a time-fixed treatment A on an outcome 
Y. In this section, we first review this doubly robust estimator for time-fixed 
treatments and then extend it to time-varying treatments. 

Suppose we are only interested in the effect of a time-fixed treatment A, 
that is, the difference of counterfactual means E [Y%'=!] — E [Y %59], under 
exchangeability, positivity, and consistency in a setting with many confounders 
L. One possibility is to fit an outcome model for E[Y|A = a, L = l] and then 
standardize (parametric g-formula); another possibility is to fit a treatment 
model for Pr[A = 1|L] and then use it to compute weights W4 = 1/f (AL) 
(IP weighting). A doubly robust method estimates both models and combines 
them. The doubly robust procedure has three steps. 

The first step is to use the predicted values Pr [A = 1|L] from the treat- 
ment model to compute the IP weight estimates W4. The second step is to 
compute the predicted values E[Y|A = a, L = l, R] from a modified outcome 
model that includes the covariate R, where R = W4 if A = 1 and R= —W4 
if A = 0. The third step is to standardize the mean of the predicted value 
E[Y|A=a,L=1,R] under A = 1 and under A = 0. The difference of the 
standardized mean outcomes is a doubly robust estimator of the causal effect 
E[Y=1] — E [Y^]. That is, under the identifiability conditions, this es- 
timator validly estimates the average causal effect if either the model for the 
treatment or for the outcome is correct. 

Let us now extend this doubly robust estimator to settings with time- 
varying treatments in which we are interested in comparing the counterfactual 


means E [Y1] and E [xz] under two treatment strategies & and a’. The doubly 


robust procedure to estimate E [Y7] for a time-varying treatment follows the 
same 3 steps as the procedure to estimate E [Y°] for a time-fixed treatment. 
However, as we will see, the second step is a bit more involved because it 
requires the fitting of sequential regression models. Next we describe how 
to obtain a doubly robust estimator of E[Y“] under the treatment strategy 
“always treated”. 

The first step requires fitting a regression model for Pr [Ak = 1Ag_1, Ly] 
and then use the predicted values from this model to estimate the time-varying 


IP weights W4m = M aa at each time m, where f (Ak|Ak-1, Lx) = 


Pr |Ak = 1|Ak-1, Le] for person-times with A, = 1 and f (Aj,|An—1, Le) = 
Pr |Ak = 0|Ag_1, Lx] for person-times with A; = 0. That is, for each indi- 
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vidual, we estimate a different weight for each time point rather than a single 
weight at the end of follow-up as in the previous section. For example, if we 
fit the parametric model Pr [Ak = 1)Az-1, La] = Qo,k + Q1 Ák—1 + @2Lp, then, 
in our example of Table 21.1 with two time points, the predicted values are 
Pr [A = 1|Ao, La] = Ĝo,1 + Ag + Ay Ly and Pr [Ao = 1|Lo] = Ao,0 + Q2Lo 
(because A_; = 0). We then compute the time-varying IP weight estimates 
Wan = Maaa In addition, we also compute the modified IP 
weight WAm-1¢m=1 = WAm-1 x Nena in which the treatment 
value at time m is set to the corresponding treatment value under the strategy 
“always treated”. We have reached the end of Step 1. 

The second step requires fitting a separate outcome regression model at each 
time m, starting from the last time K and going down towards m = 0. This 
sequence of regression models has two peculiarities. First, the time-varying IP 
weight estimate W^» is included as a covariate. Second. the outcome of the 

Because doubly robust estimation model is Y only at the last time K. At all other times m, the outcome of the 
for time-varying treatments relies model is a variable Peay that is generated by the previous regression at time 
on a sequential outcome regression, m +1. 


we need to fit the regression models For example, suppose we decide to fit the regression model 
at each time m sequentially rather 
than simultaneously. E [PinsilAm, Em] = boym + Orcum (Am) + O2Lm + WA 


where treatment history Åm is summarized by cumulative treatment as in the 
marginal structural mean model of the previous section, and covariate history 
Lm is summarized by its most recent value Lm. To define the variable Pia, 
let us consider the simple case with 2 time ports only, i.e., with K = 1. 
(Technical Point 21.3 provides the general definition for multiple times.) 


Start by fitting the model E [721A., L| =E [Y Ai, Lı] = 00,1 +01cum (Aı)+ 


Lı + b, W4: with T> = Y. Use the parameter estimates 6 to calculate 
the predicted value from this model with A; set to 1, as it should be un- 
der the strategy “always treated”, which implies that W~* needs to be re- 
placed by Wo a=1. The predicted value for each individual 7 is therefore 
Îi = = Êo, it 6, x2+ ô> Lii + ĝ, W o= | This predicted value Îi is the new 
outcome variable to be used in the next regression model. Now fit the model 
E [FilAo; Lo| = 00,0 + 91 Ap + 02 Lo + 0;W 0 and calculate again the predicted 


value with Ag set to 1, which is Ti = 8o,0 + 6, x 1+ > Loi + ĝ; W205! for 
individual 7. We have reached the end of Step 2 as there are no more time 
points. 

The third step is to standardize the mean of T, which we do by simply com- 


puting its average across all individuals. This average E [ô J is a valid doubly 


robust estimator of the counterfactual mean E [y Eaa, That is, under 

conditional exchangeability and positivity given Lm, this estimator validly es- 

timates the average causal effect if one of the three following statements holds: 

(i) the treatment model is correct at all times, (ii) the outcome model is cor- 

rect at all times, or (iii) the treatment model is correct for time 0 to k and 
k + 1 robustness was described by the outcome model is correct for times k + 1 to K, for any k < K. This last 
Molina et al. (2017). statement is known as k + 1 robustness. 

To estimate the counterfactual mean E [Y7°-°=°] under the treatment 
strategy “never treated”, repeat the above steps using (aọ = 0,a,; = 0). The 
difference of means of Ty computed under each strategy is a doubly robust 
estimator of the average causal effect E[Y0=}@=1] — E [yao0.a1=0) | 
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A doubly robust estimator of E [Y"] for time-varying treatments. Suppose we are interested in estimating the 
counterfactual mean E Ball under treatment strategy & = (ao, @1,...aK) assuming that sequential exchangeability and 
positivity hold at all times m = 0,1... Bang and Robins (2005) proposed a recursive method. For a dichotomous 
treatment and continuous outcome, the method can be implemented as follows: 


1. Fit a logistic model s (Am—1, Lm @) for Pr [Am = 1|Am—1, Lm] with data pooled over all times m = 0,1... 
and all individuals. Obtain the MLE â of the vector parameter a. For each person-time, compute both the 
usual time-varying IP weight estimate W4m = [=o Fahne)" and the modified IP weight W4m-1,¢m = 

k|4Ak-1,4k;0 

7 ien — for the value am in the strategy of interest, with A_; = 0 by definition. 

am|Am—1,4m5;% 


2. Set Tei =y4x =Y, Recursively, for m = K, K —1,...,0, 


(a) specify and fit a parametric linear regression model h (Am, Lre 0), with W4™ as a covariate, for the condi- 


tional expectation E [fm4alAm: Lml. Obtain the MLE 6 of the vector parameter 6. 


(b) set Paral aman) & În as the predicted value h (Am—1; 0m, Lm; 8) computed using the covariate 
WAm-14m rather than W4m™, 


3. Estimate E [Y7] =E [ô]. 


If either the model s (Am-1, Liri a) or the model h (Am, Lm; 9) are correctly specified, then E [7] is consistent 


for E baal Confidence intervals can be obtained using the nonparametric bootstrap. Note that, when Wm» is not 
used as a covariate, this sequential regression procedure is an alternative, non-doubly robust procedure to estimate the 
parametric g-formula. 


van der Laan and Gruber (2012) This doubly robust estimator for average causal effects is ready for use in 
proposed an extension of this dou- practice, though its widespread implementation has been historically hampered 
bly robust estimator that includes by computational constraints and lack of user-friendly software, especially for 
a data adaptive procedure. They  hazard-based survival analysis. We anticipate that, in the near future, doubly 
called their method longitudinal (or multiply) robust estimators will become more common when studying the 
targeted minimum loss-based esti- effect of complex treatment strategies on failure time outcomes. See Fine Point 
mation (TMLE). 21.2 for a description of the different representations of the g-formula and their 
connections to the above doubly robust estimator. 


21.4 G-estimation for time-varying treatments 


If we were only interested in the effect of the time-fixed treatment A; in Table 
21.1, in Chapter 14 we described structural nested mean models for the con- 
ditional causal effect of a time-fixed treatment within levels of the covariates. 
Those models had a single equation because there was a single time point k = 0. 
The extension to time-varying treatments requires that the model specifies as 
many equations as time points in the data. For the time-varying treatment 
A = (Ao, A1) at two time points in Table 21.1, we can specify a (saturated) 
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Fine Point 21.2 


Representations of the the g-formula. The g-formula can be mathematically represented in several ways. These 
different representations of the g-formula are nonparametrically equivalent but lead to different estimators in practice. 
Throughout this book we have emphasized a representation of the g-formula that is the generalized version of standard- 
ization (in the epidemiologic jargon). That is, the g-formula for a mean outcome is $), E[Y|A=a,L = l] f (l) for a 


a S K £ 
time-fixed treatment and, as described in this chapter, 7; E [Y|A =4,L= l] II f (lx|āk-1, lk—1) for a time-varying 
k=0 
treatment. Because a plug-in estimator based on this representation of the g-formula requires estimates of the joint 


K = 

density of the confounders || f (lx|āk-1, lk—1) over time, we refer to it as a joint density modeling estimator of the 
k=0 

g-formula. 


An alternative representation of the g-formula is a conditional expectation. For a time-fixed treatment, we im- 
plicitly used this g-formula representation E [E [Y|A = a, L =] in Section 13.3. For a time-varying treatment, the 
representation is an iterated conditional expectation (ICE) that can be recursively defined. A plug-in estimator based on 
the ICE representation of the g-formula requires the fitting of sequential predictive algorithms (e.g., regression models). 
The ICE estimator is described in Section 21.3 and Technical Point 21.3, where we combine it with the estimation of 
IP weights to construct doubly (or & + 1) robust estimators. 

Another equivalent representation of the g-formula is IP weighting. In fact, as shown in Technical Point 2.3 for 
time-fixed treatments, the standardized mean and the IP weighted mean are equal under positivity. The same is true for 
time-varying treatments (Robins, 1986; Young et al, 2014). As described in this chapter, an estimator based on the IP 
weighting representation of the g-formula requires the estimation of the conditional density of treatment over time given 
past treatment and covariate history. We refer to these estimators as IP weighted estimators rather than as g-formula 
estimators. 


structural nested mean model with two equations 


For time k = 0: E[Y01=9 — yoo=01=0) = Boag 
For time k = 1: E [yeu = year Le = l, Ao = ao] = 
aı (B11 + Bi2l1 + 1300 + B1400l1) 


Effect of a; when ao is set to 0: The second equation models the effect of treatment at time k = 1 within 

each of the 4 treatment and covariate histories defined by (Ao, £1). This com- 

e 61, in individuals with ponent of the model is saturated because the 4 parameters 6) in the second 

ni =0 equation parameterize the effect of aj on Y within the 4 possible levels of 

past treatment and covariate history. The first equation models the effect of 

treatment at time k = 0 when treatment at time k = 1 is set to zero. This 

component of the model is also saturated because it has one parameter Bo to 

estimate the effect within the only possible history (there is no prior treatment 
or covariates, so everybody has the same history). 


e 611 + Biz in those with 
Leo eet 


Effect of a; when ao is set to 1: 


e 611 + B13 in those with The two equations of the structural nested model are the reason why the 

L= =0 model is referred to as nested. The first equation models the effect of receiving 
treatment at time 0 and never again after time 0, the second equation models 
the effect of receiving treatment at time 1 and never again after time 1, and 
so on if we had more time points. 


e 611 +613 + B14 in those with 
L= = 1 


By consistency, LP = Lı. Let us use g-estimation to estimate the parameters of our structural nested 
model with K = 1. We follow the same approach as in Chapter 14. We start 
by considering the additive rank-preserving structural nested model for each 
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The proof can be found in Robins 
(1994). Note that to fit an unsatu- 
rated structural nested mean model 
by g-estimation, positivity is not re- 
quired. 








Table 21.2 
Ao Li Ay Mean Hı (w) 
0 0 0 84 
0 0 1 84 — 14 
0 1 0 52 
0 1 1 52-11 - di2 
1 0 0 76 
1 0 1 6-411 -413 
1 1 0 44 
1 1 1 44a — Yin 
—13 — V4 
Table 21.3 
Ao Lı A, Mean Ho (w) 
0 0 0 84 
0 0 1 84 
0 1 0 52 
0 1 1 52 
1 0 0 76 — Wo 
1 0 1 76 — Wo 
1 1 0 44 — Yo 
1 1 1 44 — Yo 
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individual 4 
yon = ye + poao 
Yoo — y 200 L% L% 
i =Y; + Yid + Pra Ly, + Visado + Y140100 L1; 
(We represent Y,°°~-°~° by Y,°° to simplify the notation.) 
The first equation is a rank-preserving model because the effect Yo is exactly 
the same for every individual. Thus if pa for subject i exceeds ye for subject 


j, the same ranking of i and j will hold for Y 1:°—the model preserves ranks 
across strategies. Also, under equation 2, if sta for subject 7 exceeds yo for 





subject j, we can only be certain that ye for subject 7 also exceeds yo for 
subject 7, if both have the same values of igs Because the preservation of 
the ranking is conditional on local factors (i.e., the value Lo), we refer to 
the second equation as a conditionally, or locally, rank-preserving model. 

As discussed in Chapter 14, rank preservation is biologically implausible 
because of individual heterogeneity in unmeasured genetic and environmen- 
tal risks. That is why our primary interest is in the structural nested mean 
model, which is totally agnostic as to whether or not there is effect hetero- 
geneity across individuals. However, provided the strengthened identifiability 
conditions hold, g-estimates of Y% from the rank-preserving model are consistent 
for the parameters ( of the mean model, even if the rank-preserving model is 
misspecified. 

The first step in g-estimation is linking the model to the observed data, 
as we did in Chapter 14 for a time-fixed treatment. To do so, note that, by 
consistency, the counterfactual outcome Y%’*! is equal to the observed out- 
come Y for individuals who happen to be treated with treatment values ag and 
ai. Formally, Y%%%: = Y40-41 = Y for individuals with (Ag = ao, A, = a1). 
Similarly Y¢° = Y 40-9 for individuals with (Ap = ao, Aı = 0), and LP = Ly 
for individuals with Ap = aj. Now we can rewrite the structural nested model 
in terms of the observed data as 


yoo -y (ir A1 + di2Ai Ly + 13A1A0 + Y14A1 AoL1) 
yo.0 = y 40,0 = Wp Ao 


(we are omitting the individual index i to simplify the notation). 

The second step in g-estimation is to use the observed data to compute 
the candidate counterfactuals Hı (Yt) and Ho (yt). To do so, we use the 
structural nested model with the true value w of the parameter replaced by 
some value yt: 


Hy (Wt) =Y — (of, Ar + ylz L1 + Uf 4i do + vl ArAoZ1) 
Ho (") = Ay (4*) - Wi Ao 


As in Chapter 14, the goal is to find the value yt of the parameters that is equal 
to the true value y. When yt = y, the candidate counterfactual Hy (yt) equals 
the true counterfactual Y@-1-2. We can now use sequential exchangeability 
to conduct g-estimation at each time point. Fine Point 21.3 describes how to g- 
estimate the parameters 7 of our saturated structural nested model. It turns 
out that all parameters of the structural nested model are 0, which implies 
that all counterfactual means E [Y9] under any static or dynamic strategy g 
are equal to 60. This result agrees with those obtained by the g-formula and 
by IP weighting. G-estimation, like the g-formula and IP weighting, succeeds 
where traditional methods failed. 
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G-estimation with a saturated structural nested model. Sequential exchangeability at k = 1 implies that, within 
any of the four joint strata of (Ag, L1), the mean of Y40-° among individuals with A; = 1 is equal to the mean among 
individuals with A; = 0. Therefore, the means of Hy (yt) must also be equal when wt = y. 

Consider first the stratum (Ag, L1) = (0,0). From data rows 1 and 2 in Table 21.2, we find that the mean of 
HT, (w) is 84 when A; = 0 and 84 — a, when A; = 1. Hence v1, = 0. Next we equate the means of H; (y) in 
data rows 3 and 4 corresponding to stratum (Ao, £1) = (0,1) to obtain 52 = 52 — 11 — dig. Since 11 = 0, we 
conclude %12 = 0. Continuing we equate the means of H; (y) in data rows 5 and 6 to obtain 76 = 76 — %11 — %13. 
Since %11 = Y12 = 0, we conclude 713 = 0. Finally, equating the means of H; (Y) in data rows 7 and 8, we obtain 
44 = 44 — P11 — V2 — P13 — Via SO W14 = 0 as well. 

To estimate wo, we first substitute the values %11, Y12, Y13, and %14 into the expression for the mean of Ho () in 
Table 21.2. In this example, all parameters were equal to 0, so the mean of Ho (Y) was equal to the mean of the observed 
outcome Y. We then use the first equation of the structural equation model to compute the mean of Ho (w) for each 
data row in the table by subtracting WoAo from the mean of H; (w), as shown in Table 21.3. Sequential exchangeability 
Y°° || Ag at time k = 0 implies that the means of Ho (Y) among the 16,000 subjects with Ay = 1 and the 16,000 
subjects with Ag = 0 are identical. The mean of Ho (Y) is 84 x 0.25 + 52 x 0.75 = 60 among individuals with Ag = 0, 
(76 — Wo) x 0.5 + (44 — po) x 0.5 = 60 — Wo among individuals with Ag = 1. Hence Wo = 0. We have completed 
g-estimation. 





In practice, however, we will encounter observational studies with multiple 
times k and multiple covariates L;, at each time. In general, a structural nested 
mean model has as many equations as time points k = 0, 1...K. The general 
form of structural nested mean models is therefore the following: For each time 
k=0 iK 


R pA — YR LR = Ip, Aga = Tr-1] = akYk (G1, 1k, 8) 


where (G1, Qk; 0,41) is a static strategy that assigns treatment G@,_1 between 
times 0 and k—1, treatment ax at time k, and treatment 0 from time k = 1 until 
The function yz (@.—1, lx, 8) satis- the end of follow-up K. The strategies (ūk-1, ak, Dera) and (ūk—1, 0%) differ 
fies Yp (Z—1,1x,0) = 0 so 6 =0 only in that the former has treatment a; at k while the latter has treatment 0 
under the null hypothesis of no ef- at time k. 
fect of treatment. That is, a structural nested mean model is a model for the effect on the 
mean of Y of a last blip of treatment of magnitude a, at k, as a function 
Vk (G1, la, B) of past treatment and covariate history (ār-1, lx). See Tech- 
nical Point 21.4 for the relationship between structural nested models and 
marginal structural models. 
In our example with K = 1, yo (a_1, lo, 8) is just 6o (lo and G@_ 1 can both 
be taken to be identically 0) and J1 (ao, l1, B) is 611 + Piely + 81300 + b14đ0l1. 
The candidate counterfactuals for models with several time points k, can be 
compactly defined as 


K 
Hy (4t) =Y — X Ajy (A-1, Lj, 4") 
j=k 


With multiple time points or covariates, we will need to fit an unsaturated 
structural nested mean model. For example, we might hypothesize that the 
function Yk (ūār-1; lk, b) is the same for all k. The simplest model would be 
Yk (āk-1,lk, B) = 31, which assumes that the effect of a last blip of treatment 
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Technical Point 21.4 


Relation between marginal structural models and structural nested models (Part II). We can now generalize the 
results in Fine Point 14.1 to time-varying treatments. A structural nested mean model is a semiparametric marginal 
structural mean model if and only if, for all (G1, le, B), 


Yk (Tk-1, le, 8) = Ye (Ges, B) 


Specifically, it is a semiparametric marginal structural mean model with the functional form 


K 
E Rall =E E] F X ary (Gn, lk, b) ; 


k=0 


where E [y] is left unspecified. However, such a structural nested mean model is not simply a marginal structural 


mean model, because it also imposes the additional strong assumption that effect modification by past covariate history 
is absent. In contrast, a marginal structural model is agnostic as to whether there is effect modification by time-varying 
covariates. 

If we specify a structural nested mean model yx (@—1, 8), then we can estimate 8 either by g-estimation or IP 
weighting. However the most efficient g-estimator will be more efficient than the most efficient IP weighted estimator 
when the structural nested mean model (and thus the marginal structural mean model) is correctly specified, because 
g-estimation uses the additional assumption of no effect modification by past covariates to increase efficiency. 

In contrast, suppose the marginal structural mean model is correct but the structural nested mean model is incorrect 
because Yk (@.-1, lx, B) # Vp (Ge—-1, 3). Then the g-estimates of 8 and E [Y7] will be biased, while the IP weighted 
estimates remain unbiased. Thus we have a classic variance-bias trade off. Given the marginal structural model, 
g-estimation can increase efficiency if yz (G,-1, lx, B) = Yr (G1, 2), but introduces bias otherwise. 





is the same for all times k. Other options are 6; + 82k, which assumes that the 
effect varies linearly with the time k of treatment, and 3; + 62k + B3ak-1ı + 
Balk + Bslkak—1ı, which allows the effect to be modified by past treatment and 
covariate history. 

To describe g-estimation for structural nested mean models with multiple 
time points, suppose the nonsaturated model is yk (G.-1, Tk, 8) = 6,. The 


K 
corresponding rank-preserving model entails Hk (yt) =Y- X Ayt, which 
j=k 
can be computed from the observed data for any value yt. We will then choose 
values Wiow and Yup that are much smaller and larger, respectively, than any 
substantively plausible value of ~, and will compute for each individual the 
value of Hķk (pt) for each y+ on a grid from qo, to Pup, SAY Wiow, Vlow + 
0.1, Piow + 0.2, ..., Yup- 
Then, for each value of yt, we will fit a pooled logistic regression model 


logit Pr [Ak = 1|, Hy (pt) , Lx, Ak—1] = Qo + ai Hk (yt) + a2 Wk 


for the probability of treatment at time k for k = 0,..., K. Here Wk = 
Wk (Lk, Ak—1) is a vector of covariates calculated from an individual’s covari- 
ate and treatment data (Lr, Ak), Q@2 is a row vector of unknown parameters, 
and each person contributes K + 1 observations. Under the assumptions of 
sequential exchangeability and consistency, the g-estimate of ~, and therefore 
of 3, is the value yt that results in the estimate of a; that is closest to 0. 
The procedure described above is the generalization to time-varying treat- 
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Technical Point 21.5 


A closed form estimator for linear structural nested mean models. When, as in all the examples we have discussed, 
Yk (@i.—1, lx, b) = BTR, is linear in 8 with Rk = rz (Lk, Ax—1) being a vector of known functions, then, given the 
model logit Pr [Ak = 1[Ly, Ax-1| = aT Wọ, there is an explicit closed form expression for 3 given by 


1 


i=N,k=K ~ (i=N,k=K 
p= XO Ain Xin (@) Qi,e Si XO Vi Xin â) Qir 
i=1,k=0 i=1,k=0 


with X; k (@) = [Aik — expit (@TWik)], Sik = paved Rik, and the choice of Qik = qk (Lik, Ai k-1) affects 
efficiency but not consistency. See Robins (1994) for the optimal choice of Qx. 
In fact, when yk (ūr-1,lk, B) is linear in 8, we can obtain a closed-form doubly robust estimator 8 of 8 by specifying 


a working model ¢7 Dy = sTdk (Lk, Ak—1) for E [Hk (8) |Li, Akı] =E [Y-i E, Ar- and defining 


-1 
~ i=N,k=K i=N,k=K 
: Aik Xir (@) Qi : Xir (@) Qi 
OE E aaa MDL ) 
s i=1,k=0 ak i=1,k=0 t 
Specifically B will be a consistently asymptotically normal estimator of w if œi 
ther the model sT Dk for E [y0 |Ik, Arı] is correct or the model for 


logit Pr [Ay = 1|Lp, Ax—1] is correct. 





The limits of the 95% confidence ments of the g-estimation procedure described in Chapter 14. For simplicity, we 

interval for w are the limits of the considered a structural nested model with a single parameter 81, which implies 

set of values y? that result ina P- that the effect does not vary over time k or by treatment and covariate history. 

value> 0.05 when testing for a, = Suppose now that the parameter ( is a vector. To be concrete suppose we con- 

0. sider the model with yz (G1, le, b) = 89 + Bik + B2ak—1 + Blk + Balkak—ı so 
6 is 5-dimensional and lm is 1-dimensional. Now to estimate 5 parameters one 
requires 5 additional covariates in the treatment model. For example, we could 
fit the model logit Pr [Ay = 1|H;, (wt) , Lx, Ak-1] = 


ao + Hy (Y') (ay + ask + a3 Ag—1 + a4 Ly + asLrAk-1) + GW, 


The particular choice of covariates does not affect the consistency of the point 
estimate of 6, but it determines the width of its confidence interval. 
The g-estimation procedure then requires a search over a 5-dimensional 
grid, one dimension for each component ĝ; of 6. So if we had 20 grid points 
A 95% joint confidence interval for for each component we would have 20° different values of 3 on our 5 dimen- 
j are the set of values for which sional grid. The g-estimate 6 is the 8 for which the 5 degree of freedom score 
the 5 degree-of-freedom score test test that all 5 (a1, a2, a3, a4, 5) are precisely zero. However, when the dimen- 
does not reject at the 5% level. sion of 8 is greater than 2, finding the g-estimate B by a grid search may be 
A less computationally demanding computationally prohibitive. Fortunately, there is a closed form estimator of 
approach is to compute a univari- 8 that does not require a grid search when, as in all examples discussed in this 
ate 95% Wald confidence interval section, the structural nested mean model is linear. See Technical Point 21.5, 
as j + 1.96 times its standard er- which also describes a doubly robust form of the estimator. 
ror. After g-estimation of the parameters of the structural nested mean model, 
the last step is the estimation of the counterfactual mean E[Y9] under the 
strategies g of interest. If there is no effect modification by past covariate 
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Technical Point 21.6 


Estimation of E [Y9] after g-estimation of a structural nested mean model. Suppose the identifiability assumptions 
hold, one has obtained a doubly robust g-estimate 8 of a structural nested mean model yk (@i.-1, lx, B) and one wishes 
to estimate E [Y9] under a dynamic strategy g. To do so, one can use the following steps of a Monte Carlo algorithm: 


1. Estimate the mean response E ee had treatment always been withheld by the sample average of Ho (3) over 


the N study subjects. Call the estimate E ka 


2. Fit a parametric model for f (lx|āk-1, l1) to the data, pooled over persons and times, and let f (lx|āz-1, Ix-1) 
denote the estimate of f (lx|āk-1, lz—1) under the model. 


3. Do for v = 1,...,V, 


(a) Draw ly o from f (lo). 


(b) Recursively for k = 1, ..., K draw ly, from F (Usl@v,e—15lv,4—1) with Gy k-1 = Jk—1 (lo k-1), the treatment 
history corresponding to the strategy g. 
j=K 24 _ 
(c) Let Ags. = X asy (Tagilo) be the vt” Monte Carlo estimate of Y9 — Y°*, where vj = 
j=0 
gj (lo j—1). 


v=V 
4. Let E [Y9] = E [y®] + Y Any be the estimate of E [Y9]. 
v=1 


If the model for f (Ux |G@x—1, lk—1), the structural nested mean model yk (ūr-1,lk, B), and either the treatment model 
Pr |Ak = 1|Lx, Ak—1] or the outcome model E [YA] E, Ar~] are correctly specified, then E [Y9] is consistent 
for E [Y9]. Confidence intervals can be obtained using the nonparametric bootstrap. 


Note that yk (a-1,4.,8) will converge to 0 if the estimate B is consistent for 6 = 0. Thus Bi) will converge 


to zero and E [Y] to E ier even if the model for f (lel@x—1, 4-1) is incorrect. That is, the structural nested 
mean model preserves the null if the identifiability conditions hold and we either know (as in a sequentially randomized 
experiment) Pr [Ax = 1[Ly, Ak-1] or have a correct model for either Pr [Ax = 1|Ly, Ak-1] or E [Y 7-10 |Lk, Ar| : 





history, e.g., 
Yr (Gk—15 1,8) = Ye (Ge—1,8) = Br + Bok + B3an—1 + baak—2 + B5an—10%—2 


then E [Y7] under a static strategy @ is estimated as 


E [Y7] = E Fa + Se (a1, 5) 
k=0 


On the other hand, if the structural nested mean model includes terms for Lk 
or we want to estimate E[Y%] under a dynamic strategy g, then we need to 
simulate the Lẹ using the algorithm described in Technical Point 21.6. 
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21.5 Censoring is a time-varying treatment 


You may want to re-read Section 
12.6 for a refresher on censoring. 


Conditioning on being uncensored 
(C = 0) induces selection bias un- 
der the null when C is either a col- 
lider on a pathway between treat- 
ment A and the outcome Y, or the 
descendant of one such collider. 


The use of the superscript € = 0 
makes it explicit the causal contrast 
that many have in mind when they 
refer to the causal effect of treat- 
ment A, even if they choose not to 
use the superscript ¢ = 0. 


Throughout this chapter we have used an example in which there is no cen- 
soring: the outcomes of all individuals in Table 21.1 are known. In practice, 
however, we will often encounter situations in which some individuals are lost to 
follow-up and therefore their outcome values are unknown or (right-)censored. 
We have discussed censoring and methods to handle it in Part II of the book. 
In Chapter 8, we showed that censoring may introduce selection bias, even 
under the null. In Chapter 12, we discussed how we are generally interested in 
the causal effect if nobody in the study population had been censored. 

However, in Part II we only considered a greatly simplified version of cen- 
soring under which we did not specify when individuals were censored during 
the follow-up. That is, we considered censoring C’ as a time-fixed variable. A 
more realistic view of censoring is as a time-varying variable C1, Co, ...Ck41, 
where Cm is an indicator that takes value 0 if the individual remains uncen- 
sored at time m and takes value 1 otherwise. Censoring is a monotonic type 
of missing data, that is, if an individual’s Cm = 0 then all previous censoring 
indicators are also zero (C1 = 0, Cp = 0....Cm = 0). Also, by definition, Co = 0 
for all individuals in a study; otherwise they would have not been included in 
the study. 

If an individual is censored at time m, i.e., when Cm = 1, then treatments, 
confounders, and outcomes measured after time m are unobserved. Therefore, 
the analysis becomes necessarily restricted to uncensored person-times, i.e., 
those with Cm = 0. For example, the g-formula for the counterfactual mean 
outcome E[Y°] from section 21.1 needs to be rewritten as 


K 
STE [Y|A =a, =0,L =l] II f (ls|ūk-1, Ck-1 = 0, lk—1) y 
k=0 


l 
with all the terms being conditional on remaining uncensored. 


Suppose the identifiability conditions hold with treatment Am replaced by 
(Am, Cm) at all times m. Then it is easy to show that the above expression 
corresponds to the g-formula for the counterfactual mean outcome E [Y%*°] 
under the joint treatment (a,c = 0), that is, the mean outcome that would 
have been observed if all individuals have received treatment strategy a and 
no individual had been lost to follow-up. 

The counterfactual mean E [Y%°=°] can also be estimated via IP weighting 
of a structural mean model when the identifiability conditions hold for the 
joint treatment (A, C). To estimate this mean, we might fit, for example, the 
outcome regression model 


E [Y|A, C= 0] = bo + 61cum (A) 


to the pseudo-population created by the nonstabilized IP weights W4 x we 


where 
K+1 


= 1 
C _ 
me 7 II Pr (Cy = 0|Ak—1, Ck—1 =0, Lx) 


k=1 





We estimate the denominator of the weights by fitting a logistic regression 
model for Pr (Cr = OļlAk-1, Ck—1 = 0, Lx). 

In the pseudo-population created by the nonstabilized IP weights, the 
censored individuals are replaced by copies of uncensored individuals with 
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Remember: : 
The estimated IP weights SW 
have mean 1 when the model for 
Pr (Cy = O|Ax—1, Cr-1 = 0, Lr) 
is correctly specified. 
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the same values of treatment and covariate history. Therefore the pseudo- 
population has the same size as the original study population before censoring, 
that is, before any losses to follow-up occur. The nonstabilized IP weights 
abolish censoring in the pseudo-population. 

Or we can use the pseudo-population created by the stabilized IP weights 
SW4 x SW, where 


sw? = T Pr (Cr 7 O|An—1, Ck-1 = 0) 
Pr (Cr = 0|Ak—1, Ck-1 = S 


k=1 





We estimate the denominator and numerator of the IP weights via two separate 


models for Pr (Cr = OlAk-1, Cki = 0, Li) and Pr (Cr = O|Ag—1, Cr-1 = 0), 
respectively. 

The pseudo-population created by the stabilized IP weights is of the same 
size as the original study population after censoring, that is, the proportion 
of censored individuals in the pseudo-population is identical to that in the 
study population at each time k. The stabilized weights do not eliminate 
censoring in the pseudo-population, they make censoring occur at random at 
each time k with respect to the measured covariate history Ly. That is, there 
is selection but no selection bias. Regardless of the type of IP weights used, 
in the pseudo-population there are no arrows from Ly into future Cm for m > 
k. Importantly, under the exchangeability conditions for the joint treatment 
(A, C), IP weighting can unbiasedly estimate the joint effect of (A, C) even 
when some components of L are affected by prior treatment. 

Finally, when using g-estimation of structural nested models, we first need 
to adjust for selection bias due to censoring by IP weighting. In practice, 
this means that we first estimate nonstabilized IP weights WC for censoring 
to create a pseudo-population in which nobody is censored, and then apply 
g-estimation to the pseudo-population. 


Chapter 22 
TARGET TRIAL EMULATION 


As discussed in Part I, causal inference from observational data can be viewed as an attempt to emulate a hypo- 
thetical randomized trial, which we refer to as the target trial. Since we now have all the tools that are needed 
to tackle causal inferences with time-varying treatments, this chapter generalizes the concept of the target trial 
to sustained treatment strategies and outlines a unified framework for causal inference, regardless of whether the 
data arose from a randomized experimental or an observational study. 

This chapter also describes a taxonomy of causal effects that may be of interest when emulating a target trial, 
including intention-to-treat and per-protocol effects. Valid estimation of those causal effects generally requires 
data on time-varying prognostic factors and treatments, as well as appropriate adjustment for those time-varying 
factors using g-methods. It is precisely the development of g-methods that makes the concepts discussed here 
something more than a formal exercise: if data are available, the effects of interest can now be validly estimated. 


22.1 The target trial (revisited) 


To fix ideas, consider a randomized trial to estimate the effect of antiretroviral 
therapy on the 5-year risk of death among HIV-positive individuals. Eligible 
participants—over 18 years of age, no AIDS, no previous use of antiretroviral 
therapy—are randomly assigned to either treatment strategy g or treatment 
strategy g’ at the start of follow-up k = 0 (baseline). Their follow-up starts 
at the time of assignment and ends at death, loss to follow-up, or 60 months 
after baseline, whichever occurs earlier. Of course, as in any trial, not all 
participants adhere to the treatment strategies defined in the trial protocol. 
That is, there are deviations from protocol. 

Our trial is a pragmatic trial. In particular, the participants and their 
treating physicians are aware of the treatment they receive (i.e., the treatment 
assignment is not blinded), nobody receives a placebo (i.e., both strategies 
g and g’ involve either active treatments or no treatment), and participants 
are monitored as frequently and intensely as regular patients outside of the 
study. A pragmatic trial is preferable when the goal is quantifying the effects 
of treatment strategies under realistic conditions, including that physicians and 
participants are aware of the care received by the latter. 

If conducting this pragmatic randomized trial were not possible, we may 
attempt to emulate it through the analysis of existing observational data. We 

See Hernán and Robins (2016) for then refer to the trial as the target trial for our observational analysis. 

more details about the characteris- Specifying the protocol of the target trial is a useful device to clarify the 

tics of the target trial. causal question of interest that we wish our observational analysis to answer. 
At the very least, we need to specify the following key components of the 
protocol: eligibility criteria, start and end of follow-up, treatment strategies, 
outcomes of interest, causal contrast, and data analysis plan. Note that a 
precise specification of the protocol of the target trial may require some explo- 
ration of the available data. For example, only after having determined that 
the data included information on HIV diagnosis, we can reasonably propose to 
emulate a target trial of HIV-positive individuals. 
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Technical Point 22.1 


Controlled direct effects. Consider the average causal effect of a treatment A on an outcome Y when an intermediate 
variable—or mediator—B is set to a particular value. We refer to this quantity as the direct effect of A on Y not 
through B. If the mediator B could take two values (0 or 1), then we can define the direct effect of A on Y when B is 
set to 1 and the direct effect of A on Y when B is set to 0. On the additive scale, these two direct effects are defined 
by the counterfactual differences E[Y2=1=1] — E[Y*==1] and E [Ye=}=0] — E [y2=.2=0] | respectively. These 
direct effects, which are often referred to as controlled direct effects, could, in principle, be identified by conducting an 
experiment with sequential randomization for both treatment A and mediator B, or by emulating such target experiment 
using observational data. (Technical Point 22.2 describes other types of direct effects for which no target experiment 
exists. ) 

Suppose we conduct a randomized experiment in which participants are randomly assigned at baseline to either 
treatment A = 1 or A = 0 and one month after baseline to either treatment B = 1 or B = 0. Thus all individuals will 
be placed in one of four groups: (A = 1, B = 1), (A=1,B=0), (A=0,B=1), or (A=0,B=0). The outcome 
of interest Y is measured at 3 months in all individuals (for simplicity, suppose no individuals were lost to follow-up or 
died). This study design allows us to consistently estimate the controlled direct effects because the randomization of 
both A and B ensures that the counterfactual quantities E [Yt] = Pr ee? = 1] are consistently estimated by the 
observed risks Pr [Y = 1|A = a, B = b]. 

The controlled direct effects can also be validly estimated in observational studies as long as the identifiability 
conditions of consistency, positivity, and exchangeability hold for both A and B. A precise characterization of these 
identifiability conditions was actually provided in Chapter 19 because a controlled direct effect is just a particular case 
of a contrast of treatment strategies sustained over time. To see so, simply replace A and B by Ag and A, in the above 
expressions. More generally, both the treatment A and the mediator B can be time-varying themselves. 














The acronym PICO (Population, We introduced the concept of the target trial in Chapter 3. However, 

Intervention, Comparator, Out- Parts I and II only referred to simplistic target trials that compared time-fixed 

come) has been proposed to sum- treatments. We are now ready to discuss realistic target trials that compare 

marize some of the components of sustained treatment strategies like gı “take therapy continuously during the 

the target trial (Richardson et al. follow-up, unless a contraindication or toxicity arises” and go “refrain from 

1995). taking therapy during the follow-up”. The next section defines causal effects 
of interest in (real and emulated) randomized trials concerned with sustained 
treatment strategies. Additional contrasts of sustained strategies—referred to 
as direct effects—are described in Technical Point 22.1. 


22.2 Causal effects in randomized trials 


Let us review three types of causal effects that may be of interest in a random- 
ized trial. To do so, we need some familiar notation. Let A, take value 1 if the 
individual receives therapy at time k and 0 otherwise, and C% take value 0 if 
the individual remains uncensored at time k and 1 otherwise, for k = 0,1, 2... 
with K = 59. Our trial will assign eligible individuals to either the strategy 
gi “receive treatment A; = 1 continuously during the follow-up unless a con- 
traindication or toxicity arises” or the strategy go “receive treatment A, = 0 
continuously during the follow-up”. The assignment indicator Z takes value 1 
if the individual is assigned to g; and 0 if assigned to go. 

In the previous chapters of Part III, we were interested in the causal effect of 
treatment on an outcome Y measured at the end of follow-up. Here we extend 
our description to causal effects on a failure time outcome. That is, the goal of 
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Technical Point 22.2 


Natural direct effects and principal stratum direct effects. Besides the controlled direct effects described in 
Technical Point 22.1, there exist other definitions of the average direct effect of a treatment A on an outcome Y when 
a potential mediator B is set to a particular value. 

The natural direct effect of A on Y not through B is the average causal effect of A on Y if the value of B had been 
set to the value that B would have taken if A had been set to 0, that is, if B had been set to the value B?=° (which is 1 


for some individuals and 0 for others). The natural direct effect, defined by the contrast E yer | —E eee. ; 


is a cross-world quantity because it requires to consider a counterfactual outcome simultaneously indexed by both a = 1 
and a = 0. Therefore, the natural direct effect cannot be identified in a randomized experiment, not even in principle, 
and the magnitude of the natural direct effect estimates from observational data cannot be verified. Despite the scientific 
impossibility of confirming these estimates, natural direct effects are often the goal of causal mediation analyses. This is 
probably because, under strong assumptions, total treatment effects can be decomposed into natural direct and indirect 
effects. Natural direct effects were introduced by Robins and Greenland (1992), which referred to them as pure direct 
effects; Pearl (2001) renamed them as natural direct effects. For a review, see the book by VanderWeele (2015). 

The principal stratum direct effect of A on Y if the value of B had been set to b is the average causal ef- 
fect of A on Y in the subset of the population whose value of B would have been equal to b regardless of the 
value of A, that is, in the subset of the population with B°=° = B°-! = b. Then the principal stratum direct 
effect is defined by the contrast E Le B= = Beat b] — E bas OP Rea = Beat = b| , which is equal to 
E eee = Bl = b] —E re? = [|8259 = B5! = b]. Note that, unlike the other types of direct effects, 
principal stratum direct effects do not involve joint counterfactuals Y®?, just the counterfactuals Y® in a subset of the 
population so, in that sense, they are the total (rather than direct) effect of treatment in that subset of the population. 
Principal stratum direct effects have little scientific relevance when A affects B in almost all individuals, because then 
they apply to the very small subset of the population with B°=° = B+. In practice, B is often coarsened (typically 
into a binary indicator) to increase the size of the principal stratum, but coarsening itself may make the principal stratum 
direct effect less scientifically relevant. Principal stratum direct effects were introduced by Robins (1986) and popular- 
ized by Rubin (2004). Frangakis and Rubin (2002) used the concept of principal stratum as a tool to handle competing 
events. 














our trial is to estimate the causal effect on survival (see Technical Point 22.3). 
Let Dp be an indicator for death (1: yes, 0: no) by month k = 1,2...K +1. 
First, let us consider the effect of assignment to the treatment strategy, 
Chapter 9 introduced the concepts regardless of treatment actually received. This effect, commonly known as the 
of intention-to-treat effect and per- intention-to-treat effect, is defined by a contrast of the outcome distribution 
protocol effect for time-fixed treat- under the interventions: 


ments. 
e be assigned to strategy gı at baseline and remain under study until the 


end of follow-up 


e be assigned to strategy go at baseline and remain under study until the 
end of follow-up 


The intention-to-treat effect at time k can then be expressed as the contrast 
of the counterfactual risks of death Pr poe = 1| — Pr Deer = 1| 


under assignment to strategy gı (z = 1) versus go (z = 0) if nobody had been 
lost to follow-up through time k (@ = 0). 

In some randomized trials, assignment to and initiation of the treatment 
strategies occur simultaneously. That is, all individuals assigned to strategy gı 
start to receive treatment at time 0, regardless of whether they continue taking 
it after baseline, and no individuals assigned to strategy go receive treatment at 


time 0, regardless of whether they start taking it after baseline. In those cases, 
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Ideally, to avoid confusions about 
what should or should not be 
deemed as nonadherence through- 
out the follow-up, the protocol 
would fully specify the treatment 
strategies of interest. Then the 
per-protocol effect would be well- 
defined (Hernán and Robins, 2017). 


Target trial emulation 


the intention-to-treat effect is not only the effect of assignment but also the ef- 
fect of initiation of treatment, e.g., Pr [DTE = 1| —Pr pee =f 

The intention-to-treat effect is agnostic about any treatment decisions made 
after baseline, including discontinuation or initiation of the treatments of inter- 
est, use of non-approved concomitant treatments, or any other deviations from 
protocol. This agnosticism implies that the magnitude of the intention-to-treat 
effect may heavily depend on the particular patterns of deviation from protocol 
that occur during the conduct of each trial. Two studies with the same pro- 
tocol but conducted in different settings may have different intention-to-treat 
effect estimates with neither of them being biased. 

Second, let us consider the effect of receiving the interventions as specified 
in the study protocol. We refer to this effect as the per-protocol effect. The 
per-protocol effect is defined by a contrast of the outcome distribution under 
the interventions: 


e receive treatment strategy gı continuously between baseline k = 0 and 
end of follow-up 


e receive treatment strategy go continuously between baseline k = 0 and 
end of follow-up 


The per-protocol effect at time k can then be expressed as the contrast of the 


counterfactual risks of death Pr pa =1]-—Pr eae = 1| under full 


adherence to strategy gı versus go if nobody had been lost to follow-up through 
time k (¢ = 0). 

Sensible trial protocols will not mandate that treatment be continued no 
matter what happens to the individual. For example, our strategy gı of contin- 
uous treatment mandates treatment discontinuation when a contraindication 
or toxicity arises. That is, the per-protocol effect generally involves the com- 
parison of dynamic strategies (“do this, if X happens then do this other thing” ) 
rather than static strategies (“do this, no matter what happens”). Remember 
that we already made this point in Fine Point 19.2. 

Sometimes the study protocol is not explicit about the dynamic nature of 
the treatment strategies. For example, the protocol may simplify the descrip- 
tion of strategy gı as “receive treatment A; = 1 continuously during the follow- 
up” without explicitly stating that the therapy must be discontinued “when 
a contraindication or toxicity arises”. This simplified description of strategy 
gı may lead to misunderstandings. Specifically, an individual assigned to gı 
who discontinues therapy because of toxicity should not be labeled as someone 
who is not adhering to strategy gı. In fact, that person is perfectly adhering 
to strategy gı as (it should have been) stated in the protocol. When doing 
otherwise is not an option in the real world, discontinuation of the originally 
assigned treatment or initiation of other treatments cannot possibly be consid- 
ered a deviation from protocol. Because the per-protocol effect is defined by 
a contrast of realistic strategies, it is particularly relevant for causal inference 
research which seeks to provide evidence for decisions in the real world. 

In fact, the per-protocol effect is often the implicit target of inference. For 
example, often investigators question the fidelity of the interventions imple- 
mented in the study to the interventions described in the protocol, and say 
that there is “bias”. This language indicates that the investigators are really 
interested in comparing the interventions implemented during the follow-up 
as specified in the protocol (i.e., the per-protocol effect) and not in the ef- 
fect of assignment to the interventions at baseline (i.e., the intention-to-treat 
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effect) because nonadherence after baseline cannot possibly bias the effect of 
assignment at baseline. 

Finally, let us consider the effect of receiving interventions other than the 
ones specified in the study protocol. Suppose that, while our trial is being 
conducted, a consensus started to emerge that strategy go “receive treatment 
Ax = 0 continuously during the follow-up” is inferior to strategy gi. Therefore 
some physicians began to recommend initiation of therapy when the clinical 
course worsened, e.g., when the CD4 cell count (Lẹ) first dropped below 200 
cells/uL. As a result, many individuals in the trial who were assigned to strat- 
egy go actually followed the modified strategy gj “receive treatment A, = 0 
continuously during the follow-up but, after Lẹ < 200, switch to treatment 
A, = 1”. The contrast of outcome distributions under the interventions 


e receive treatment strategy gı continuously between baseline k = 0 and 
end of follow-up 


e receive treatment strategy gj continuously between baseline k = 0 and 
end of follow-up 


corresponds to neither the intention-to-treat effect nor the original per-protocol 
effect. Rather, it is a question about the per-protocol effect in a hypothetical 
trial in which individuals are randomized to either strategy gı or go. 

This example illustrates how causal effects of interest that do not corre- 
spond to the original per-protocol effect can be conceptualized as per-protocol 
effects in target trials that can be emulated using the randomized trial data. 
Interestingly, if the strategies of interest differ from those in the actual trial, it 
is actually disadvantageous to have all participants in the actual trial adhere 
to the strategies specified in the protocol. Specifically, complete adherence im- 
plies that the trial data cannot be used to emulate a target trial with a different 
protocol (because no individuals followed the protocol of the target trial in the 
actual data). For example, a randomized trial with full adherence in which 
HIV-positive individuals are assigned to different CD4 cell count thresholds 
at which to initiate antiretroviral therapy is of little use to emulate a trial in 
which individuals are assigned to either continuous treatment or no treatment, 
and vice versa. It is precisely the noncompliance that allows us to use the data 
from a given randomized trial to emulate other randomized trials that answer 
different, perhaps more relevant, causal questions. 


22.3 Causal effects in observational analyses that emulate a target trial 


The causal effects described above for randomized trials can be analogously 
defined for observational analyses that emulate a target trial. 

The observational analog of the intention-to-treat effect is defined by a 
contrast of the outcome distribution under the hypothetical interventions 


e initiate treatment Ag = 1 at baseline and remain under study until the 
end of follow-up 


e initiate treatment Ap = 0 at baseline and remain under study until the 
end of follow-up 


This observational analog of the intention-to-treat effect at time k can then be 
expressed as the contrast of the counterfactual risks of death Pr pers “= 1/- 
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Technical Point 22.3 


Survival analysis with time-varying treatments. Chapter 17 describes g-methods to estimate the effect of point 
interventions on failure time outcomes. Chapter 21 describes g-methods to estimate the effect of sustained treatment 
strategies on non-failure time outcomes. In practice, we often need to use g-methods to estimate the effect of sustained 
strategies on failure time outcomes. To do so, we need to combine the approaches described in Chapters 17 and 21. 
Below we sketch two approaches, based on the g-formula and on IP weighting, to estimate the counterfactual risk 
Pr cas = 1] under treatment strategy a if sequential exchangeability, positivity, and consistency hold. We assume 
no censoring for simplicity. 
The risk Pr [Di = 1] is identified by the g-formula 








5. 
J[{P: [Ds = 0|As 1 = as 1; Ls T 1; Ds 1 =0] f (leļās—-1,ls-1, Ds = 0) }. 
s=0 





A plug-in g-formula estimate can then be obtained by fitting models for the discrete-time hazards 
Pr [Dk+1 = 1|Ak = Gp, Ly = lk, Dg = 0] and for the conditional density f (lx|āk-1, l-1, Dk = 0) of the confounders 
in L over time. For example, as described in Chapter 17, a pooled logistic model can be used to adequately approximate 
the hazards. For details and an application of the method, see Young et al. (2011). 

An alternative is to fit a pooled logistic model for the hazards in which each individual receives the time-varying 
nonstabilized IP weight 





A 1 
Wr II f (Am|Am-—1, Lm) 


m=0 
or its corresponding stabilized IP weight at each time k. The parameters of that model estimate the parameters of a 


marginal structural pooled logistic model (Robins 1998). For details and an application of the method, see Hernan et 
al. (2001). 


Pr O = 1| . That is, it corresponds to the intention-to-treat effect in 
a target trial in which assignment to and initiation of the strategies occurs 
simultaneously. 

The observational analog of the intention-to-treat effect differs slightly from 
the intention-to-treat effect in trials in which some individuals assigned to a 
particular strategy may never initiate it. In our example, we would estimate an 
observational analog of the intention-to-treat effect by comparing individuals 
who do and do not initiate antiretroviral therapy at baseline. This observa- 
tional effect differs from the intention-to-treat effect of a target trial in which 
some individuals assigned to g; do not take any dose of treatment. Yet a hy- 
pothetical intervention on initiation, as opposed to assignment, of treatment 
preserves a key feature of the intention-to-treat effect: interventions are defined 
solely by events occurring at baseline. 

The observational analog of the per-protocol effect is defined identically as 
that for the target trial. In randomized trials we differentiated between the 
original per-protocol effect and the per-protocol effects in alternative target 
trials. In observational studies this difference is unnecessary because, in the 
absence of a pre-specified protocol, each per-protocol effect corresponds to a 
particular target trial. In general, we can only use observational data to emu- 
late target trials whose intended interventions are actually followed by at least 


22.4 Time zero 


Example: The highly publicized dis- 
crepancy between the estimates of 
the effect of postmenopausal hor- 
mone therapy on heart disease in 
observational studies and a ran- 
domized trial was partly due to 
use of a comparison of “current 
users” vs. “never users” in the ob- 
servational studies (Hernán et al., 
2008). This comparison is rarely, 
if ever, used in randomized tri- 
als because a contrast of “preva- 
lent users” versus “non-users” , with 
prevalent user status changing over 
the follow-up, does not generally 
correspond to the contrast of two 
interventions. Further, such a con- 
trast may be particularly sensitive 
to selection bias. 


22.4 Time zero 
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some individuals in the study. In some settings, however, investigators may be 
willing to use modeling, e.g., dose-response structural models, to extrapolate 
beyond the interventions that are atually present in the data. 

An advantage of defining the causal effects in observational studies in ref- 
erence to those in the target trial is that we are then forced to be explicit 
about the strategies that are compared. Once we adopt this viewpoint, it is 
obvious that certain comparisons cannot be translated into a contrast between 
hypothetical interventions and therefore should be avoided, at least when the 
goal of the analysis is to help decision makers. Revisit Sections 3.5 and 3.6 if 
necessary. 

Another advantage of an explicit definition of causal effects in observational 
studies is clarity. As discussed in Fine Point 9.4, there is a widespread view that 
the intention-to-treat effect measures the effectiveness of treatment (loosely 
defined: the effect of treatment that would be observed under realistic condi- 
tions), whereas the per-protocol effect measures efficacy (loosely defined: the 
effect of treatment that would be observed under perfect conditions). This view 
is especially problematic when interested in sustained treatment strategies: it 
is often difficult to argue that a per-protocol effect of sustained strategies in 
a realistic setting measures efficacy, or that the intention-to-treat effect in the 
presence of uncertainty about the benefits (or harms) of treatment measures 
the effectiveness after those benefits (or harms) are proven. As a result, the 
labels “effectiveness” and “efficacy” are ambiguous in settings with sustained 
strategies over long periods. An explicit definition of the treatment strategies 
that define the causal effect of interest is more informative because decision 
makers need information about the effect of well-defined causal interventions. 


A crucial component of target trial emulation is the determination of the start 
of follow-up, also referred to as baseline or time zero, in the observational 
analysis. Eligibility criteria need to be met at that point but not later; study 
outcomes begin to be counted after that point but not earlier. 

In randomized experiments, the time zero for each individual is the time 
when they are assigned to a treatment strategy while meeting the eligibil- 
ity criteria. For example, in our randomized trial of antiretroviral therapy, 
time zero is, the time when the treatment strategies are assigned (the time 
of randomization), which usually occurs shortly before, or at the same time 
as, treatment is initiated. We do not start the follow-up, say, 2 years before 
or after randomization. Starting before randomization would not be reason- 
able because the treatment strategies had yet to be assigned at that time and 
the eligibility criteria had not been defined, much less met; starting follow-up 
after randomization is potentially biased as deaths during the first two years 
of the trial would thereby be excluded from the analysis. If treatment had a 
short-term effect on mortality, it would be missed. Even more problematic, if 
treatment does indeed have a short-term effect, then more susceptible individ- 
uals would have died by year 2 in the group assigned to active treatment but 
not in the other group. This differential proportion of susceptible individuals 
after two years destroys the baseline comparability achieved by randomization 
and opens the door to selection bias. 

The same rules regarding time zero apply to observational analyses and 
randomized trials, and for the same reasons. Generally, the follow-up in the 
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Fine Point 22.1 describes the han- 
dling of strategies that can be ini- 
tiated during a grace period after 
time zero rather than exactly at 
time zero. 


22.5 A unified analysis for 


Target trial emulation 


observational analysis should start at the time the follow-up would have started 
in the target trial. Otherwise the effect estimates may be hard to interpret and 
biased because of selection affected by treatment. Nonetheless, how to emulate 
the start of follow-up of the target trial is not always obvious. Consider two 
main scenarios, depending on how many times the eligibility criteria can be 
met throughout an individual’s lifetime: 


1. Eligibility criteria can be met at a single time. This is the simplest set- 
ting. Follow-up starts at the only time the eligibility criteria are met. For 
example, consider a study to compare immediate initiation of antiretro- 
viral therapy when the CD4 cell count first drops below 500 cells/pL 
versus delayed initiation in HIV-positive individuals. The follow-up of 
eligible individuals starts the first time their CD4 cell count drops below 
500. 


2. Eligibility criteria can be met at multiple times. This is the setting 
that often leads to confusion. For example, consider a study to compare 
initiation versus no initiation of hormone therapy among postmenopausal 
women with no history of chronic disease and no use of hormone therapy 
during the previous two years. If a woman meets these eligibility criteria 
continuously between age 51 and 65, when should her follow-up start? 
At age 51, 52, 53...? In the target trial a woman would be eligible to 
be recruited at multiple times during her lifetime, i.e., she has multiple 
eligible times. 


In settings with multiple eligibility times, there are several alternatives to 
choose the time zero of each individual among her eligible times. One could 
choose as time zero: a) the first eligible time, b) a randomly chosen eligible 
time, c) every eligible time, etc. Strategy c) requires emulating multiple nested 
target trials, each of them with a different start of follow-up. The number of 
nested trials depends on the frequency with which data on treatment and 
covariates are collected: 


e If fixed schedule for data collection at pre-specified times (e.g., every 
two years, like in many epidemiologic cohorts), then emulate a new trial 
starting at each pre-specified time. 


e If subject-specific schedule for data collection (e.g., electronic medical 
records), then choose a fixed time unit (e.g., a day, week or month), and 
emulate a new trial starting at each time unit. 


From a statistical standpoint, strategy c) can be more efficient than the 
previous ones because it uses more of the available data. However, because 
individuals may be included in multiple target trials, appropriate adjustment 
of the variance of the effect estimate is required. This can be achieved, for 
example, via bootstrapping. 


causal inference 


Unifying the causal analysis of randomized and observational studies requires 
a common language to describe both types of studies. The concept of the 
target trial provides that common language. Aside from baseline randomiza- 
tion, there are no other necessary differences between analyses of observational 
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Fine Point 22.1 


Grace periods. Consider again the study to compare immediate initiation of treatment when CD4 cell count first drops 
below 500 cells/jL versus delayed initiation. In the real world, antiretroviral therapy cannot be started exactly on the 
same day the CD4 cell count is measured. Depending on the health care system, it may take weeks or months until 
the requisite clinical and administrative procedures are completed and patients are adequately informed. Therefore, 
investigators need to define a grace period (say, 3 months) after time zero during which initiation is still considered to 
be immediate. Otherwise the study would be estimating the effect of strategies that do not occur frequently in reality 
or that could not be successfully implemented in practice. 

A consequence of using a grace period is that an individual’s observed data is consistent with more than one 
strategy for the duration of the grace period. For example, in the above study, the introduction of a 3-month grace 
period implies that the interventions are redefined as “initiate therapy within 3 months after CD4 cell count first drops 
below 500 cells/yL” versus “initiate therapy more than 3 months after CD4 cell count first drops below 500 cells/ uL”. 
Therefore an individual who starts therapy in month 3 after baseline has data consistent with both interventions during 
months 1 and 2. Had he died during those 2 months, to which arm of the target trial would we have assigned him? 
One possibility is to randomly assign him to one of the two arms. 

Another possibility is to create two exact copies of this individual—clones—in the data and assign each of the two 
clones to a different arm (Cain et al, 2010). Clones are then censored at the time their data stops being consistent with 
the arm they were assigned to. For example, if the individual starts therapy in month 3, then the clone assigned to “start 
after 3 months” would be censored at that time. The potential bias introduced by this likely informative censoring would 
need to be corrected by adjusting for time-varying factors via IP weighting. Importantly, if the individual had died in 
month 2, the both clones would have died and therefore the death would have been assigned to both arms. This double 
allocation of events prevents the bias that could arise if events occurring during the grace period were systematically 
assigned to one of the two arms only. 

When using grace periods with cloning and censoring, the intention-to-treat effect cannot be estimated because 
almost everyone will contribute a clone to each of the treatment strategies. Because each individual is assigned to all 
strategies at baseline, a contrast based on baseline assignment (i.e., an “intention-to-treat analysis” ) will compare groups 
with essentially identical outcomes. Therefore, analyses with grace period at baseline are geared towards estimating 
some form of per-protocol effect. 


data that emulate a target trial and of true randomized trials. That is, a 
randomized trial can be viewed as a follow-up study with baseline randomiza- 
tion and observational longitudinal data as a follow-up study without baseline 
randomization. 


In fact, only three things distinguish the data from randomized experi- 
ments and observational studies. In randomized experiments, (i) no baseline 
confounding is expected because of randomization, (ii) the randomization prob- 
abilities are known, and (iii) the assignment to a treatment strategy is known 
for each individual at baseline. An observational analysis can emulate (i) if 
one measures and appropriately adjusts for a sufficient set of covariates, and 
(ii) if the model for treatment assignment is correctly specified. Interestingly, 

Robins (1986) showed that, in a (iii) is not necessary for estimating the per-protocol effect in neither random- 
randomized trial, you can delete the ized experiments nor observational studies because efficient estimators (that 
randomization assignment from the are functions of the sufficient statistic) do not use this information. 


dataset and still estimate d valid The similarities between follow-up studies with and without baseline ran- 
per-protocol effect if a sufficient set dqomization are increasingly apparent in the health and social sciences as a 
of confounders was measured. growing number of randomized experiments attempt to estimate the effects of 


sustained treatment strategies over long periods in real world settings. These 
studies are a far cry from the short experiments in highly controlled settings 
that put randomized trials at the top of the hierarchy of study designs in the 
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Time-varying confounding in obser- 
vational studies is a bias with the 
same structure as nonrandom non- 
compliance in randomized trials. 


Because baseline randomization 
cannot ensure exchangeability be- 
tween those who are and are not 
lost to follow-up after randomiza- 
tion, we refer to a naive intention- 
to-treat analysis that does not ad- 
just for selection bias as a “pseudo- 
intention-to-treat analysis” . 


This form of “per-protocol analy- 
sis’ is a “pseudo-intention-to-treat 
analysis” restricted to the subset of 
the population who follow the pro- 
tocol (the per-protocol population) 
with no adjustment for covariates. 


Target trial emulation 


early days of clinical research. Randomized experiments of sustained treatment 
strategies over long periods, with their potential for substantial deviations from 
protocol (e.g., imperfect adherence to the assigned strategy, loss to follow-up), 
are subject to confounding and selection biases that we have learned to asso- 
ciate exclusively with observational studies. In particular, when estimating a 
per-protocol effect, both randomized trials and observational studies may need 
adjustment for time-varying prognostic factors that predict drop-out (selection 
bias) and treatment (confounding). 

In view of these similarities, one might expect that randomized experi- 
ments and observational studies would be analyzed similarly, except for the 
fact that adjustment for baseline confounders is typically necessary in obser- 
vational studies. In practice, however, the typical analysis of randomized ex- 
periments and observational studies differs radically, which is both perplexing 
and, as we argue below, problematic. 

A natural question is whether the “intention-to-treat analysis” and the 
so-called “per-protocol analysis” commonly used in randomized trials validly 
estimate the intention-to-treat effect and per-protocol effect, respectively. In 
general, the answer is no. A typical intention-to-treat analysis compares the 
distribution of outcomes between randomized groups without any form of ad- 
justment for confounding or selection bias. Lack of adjustment for baseline con- 
founding is justified by randomization: the randomized groups are exchange- 
able because they are expected to have the same risk of the outcome if both 
groups had been assigned to the same treatment strategy. No adjustment for 
post-randomization confounding (e.g., due to nonadherence) is required be- 
cause, again, there cannot be post-randomization confounding for the effect of 
baseline assignment. 

However, the strategies that define the intention-to-treat effect require that 
the individuals remain in the study until their outcome variable can be as- 
certained. Thus the intention-to-treat effect estimate may be affected by 
post-randomization selection bias if individuals are differentially lost to follow- 
up, and prognostic factors influence, or are associated with, loss of follow- 
up. Therefore, valid estimation of the intention-to-treat effect may require an 
“intention-to-treat analysis” adjusted for post-randomization (time-varying) 
prognostic factors to eliminate selection bias from loss to follow-up. When the 
time-varying prognostic factors are affected by prior treatment, an appropriate 
adjustment will require the use of g-methods. For example, in a randomized 
trial of antiretroviral therapy among HIV patients, the probability of drop- 
ping out of the study may be influenced by the onset of symptoms, which is a 
consequence of treatment itself. 

In addition to the primary “intention-to-treat analysis”, many randomized 
trials also report the results of a so-called “per-protocol analysis”. A com- 
monly used form of “per-protocol analysis” —also referred to as “on treatment 
analysis”—only includes individuals who adhered to the instructions speci- 
fied in the study protocol. For point interventions, the analysis includes 
only the subset of trial participants who adhered to their assigned baseline 
intervention—the per-protocol population. Then the analysis compares the 
distribution of outcomes between randomized groups in the per-protocol pop- 
ulation, typically without any form of adjustment for confounding or selection 
bias. For sustained treatment strategies, individuals are censored at the first 
time they deviate from the protocol. That is, the remaining per-protocol pop- 
ulation at each time is the set of individuals that are still adhering to the 
protocol. 

This common approach to “per-protocol analysis” is problematic for two 
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We often refer to a per-protocol 
analysis that does not even attempt 
to adjust for confounding as a naive 
per-protocol analysis. 


reasons. First, the analysis, like an “intention-to-treat analysis”, needs to 
consider postrandomization selection bias due to differential loss to follow-up. 
Second, by restricting to the per-protocol population, the analysis partly dis- 
regards the randomized groups and therefore the benefits of randomization: 
the subset of individuals who remain on protocol under one strategy may not 
be comparable with the subset on protocol under another strategy. That is, 
a “per-protocol analysis” is akin to an observational analysis. Therefore, like 
any observational analysis, a “per-protocol analysis” needs to consider bias due 
to time-varying prognostic factors that affect the decision to stay on protocol. 
When these postrandomization factors are affected by the interventions of in- 
terest, then g-methods specifically designed to deal with treatment-confounder 
feedback are needed. Instrumental variable estimation (Chapter 16) can some- 
times be used to validly estimate per-protocol effects of point interventions 
without explicit adjustment for postrandomization factors, but the validity of 
these methods depends on having a valid instrument and on strong modeling 
assumptions. Some forms of instrumental variable estimation are a particular 
case of g-estimation (see Technical Point 16.5). 


Analogously to the adjusted analyses for randomized trials, observational 
analyses need generally be adjusted for both baseline and time-varying prog- 
nostic factors using g-methods. The observational analyses are conducted by 
using the above approaches but now applied to the target trial. The goal is to 
estimate the observational analog of the intention-to-treat and the per-protocol 
effect in the target trial. 


In summary, the analysis of randomized trials and observational studies 
should be similar. If we feel compelled to adjust for time-varying confound- 
ing and selection bias in the analysis of observational studies, we should feel 
equally compelled to adjust for post-randomization confounding and selection 
bias in the analysis of randomized trials. The only necessary difference between 
follow-up studies with and without baseline randomization is, precisely, base- 
line randomization. That is, adjustment for baseline confounding will not be 
generally required in intention-to-treat analyses of randomized trials. However, 
adjustment for post-baseline (time-varying) factors will generally be necessary 
for per-protocol analyses of both randomized trials and observational studies. 
A unified approach to causal inference for sustained treatment strategies is 
possible based on the target trial concept and on g-methods. 


Historically, randomized experiments have been considered far superior to 
observational studies for the purpose of making causal inferences and aiding 
decision-making. Unfortunately, randomized experiments are not always avail- 
able because they may be expensive, infeasible, unethical, or simply untimely 
to support an urgent decision. Therefore, as much as we may like random- 
ization, many decisions will need to be made in the absence of evidence from 
randomized trials. When we cannot conduct the randomized experiment that 
would answer our causal question, we resort to observational analyses. It is 
therefore important to use a sound approach to design and analyze observa- 
tional studies. Making the target trial explicit is one step in that direction. 
When the goal is to assist decision making, the analysis of existing observa- 
tional data need to explicitly emulate a trial and be evaluated with respect to 
how well they emulate their target trial. 
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